Sentiment Indicator by Brain measures the public sentiment around US Equities. The data covers 4,500 US Equities, starting in August 2016, and is delivered on a daily frequency. This dataset is created by analyzing financial news using Natural Language Processing techniques while taking into account the similarity and repetition of news on the same topic. The sentiment score assigned to each stock is a value ranging from -1 (most negative) to +1 (most positive) that is updated with a daily frequency. The sentiment score corresponds to the average sentiment for each piece of news and it is available on two time scales; 7 days and 30 days.
In the following cell, we create a DatasetAnalyzer
to analyze a factor from the Brain Sentiment Indicator dataset. We select the sentiment
factor from the dataset and define a value function to transform the raw values into a factor we want to study. In this case, the value function returns the raw value of the dataset as it is processed already. We select our universe to be as the constituents of SPY as of 2021-08-31. As our target label, we analyze the return of the 21 days following each observation of the factor values. The in-sample period we review is 2018-01-01 to 2021-01-01 and the out-of-sample period is 2021-01-01 to 2021-08-01.
The table displayed after the cell contains the raw factor values of the dataset before the value function is applied. After this cell, the factor values are transformed using the value function, and the raw factor values are no longer used.
from dataset_analyzer import *
# (1) Select the dataset to analyze
universe = ETFUniverse("SPY", datetime(2021, 8, 31))
# (2) Select the tickers that are linked to the dataset
# For linked datasets, this should be equal to `universe`.
# For unlinked datasets, provide 1 link, for ex: ['REG']
dataset_tickers = universe
# (3) Define the factors to study
factors = [
Factor('sentiment', 'Sentiment', 'continuous', None)
]
# (4) Create the DatasetAnalyzer instance
dataset_analyzer = DatasetAnalyzer(dataset = BrainSentimentIndicator30Day,
dataset_tickers = dataset_tickers,
universe = universe,
factors = factors,
sparse_data = True,
dataset_start_date = datetime(2018, 1, 1),
in_sample_end_date = datetime(2021, 1, 1),
out_of_sample_end_date = datetime(2021, 8, 1),
return_prediction_period=21)
sentiment | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
symbol | AA.BrainSentimentIndicator30Day R735QTJ8XC9W | AAL.BrainSentimentIndicator30Day VM9RIYHM8ACK | AAP.BrainSentimentIndicator30Day SA48O8J43YAS | AAPL.BrainSentimentIndicator30Day R735QTJ8XC9W | AAS.BrainSentimentIndicator30Day R735QTJ8XC9W | ABBV.BrainSentimentIndicator30Day VCY032R250MC | ABMD.BrainSentimentIndicator30Day R735QTJ8XC9W | ABT.BrainSentimentIndicator30Day R735QTJ8XC9W | ACCOB.BrainSentimentIndicator30Day R735QTJ8XC9W | ACL.BrainSentimentIndicator30Day R735QTJ8XC9W | ... | WYNN.BrainSentimentIndicator30Day SJ56738TCX9G | XLNX.BrainSentimentIndicator30Day R735QTJ8XC9W | XON.BrainSentimentIndicator30Day R735QTJ8XC9W | XRAY.BrainSentimentIndicator30Day R735QTJ8XC9W | XYL.BrainSentimentIndicator30Day V18KR26TE3XG | YUM.BrainSentimentIndicator30Day R735QTJ8XC9W | ZBRA.BrainSentimentIndicator30Day R735QTJ8XC9W | ZION.BrainSentimentIndicator30Day R735QTJ8XC9W | ZMH.BrainSentimentIndicator30Day S6ZZPKTVDY04 | ZTS.BrainSentimentIndicator30Day VDRJHVQ4FNFO |
time | |||||||||||||||||||||
2018-01-01 12:00:00 | 0.1862 | 0.1045 | 0.1781 | 0.0464 | 0.0228 | 0.1169 | 0.25 | 0.0580 | -0.0562 | 0.1343 | ... | 0.2429 | 0.2706 | 0.0663 | 0.3000 | 0.2802 | -0.1524 | 0.361 | 0.1395 | 0.1528 | 0.275 |
2018-01-02 12:00:00 | 0.1746 | 0.0964 | 0.1361 | 0.0452 | 0.0228 | 0.1169 | 0.25 | 0.0867 | -0.0392 | 0.1343 | ... | 0.2429 | 0.2970 | 0.0656 | 0.3794 | 0.2802 | -0.1524 | 0.361 | 0.1395 | 0.1748 | 0.275 |
2018-01-03 12:00:00 | 0.0853 | 0.1092 | 0.1361 | 0.0483 | 0.0228 | 0.1140 | 0.25 | 0.0985 | 0.0026 | 0.1343 | ... | 0.1998 | 0.2970 | 0.0791 | 0.3794 | 0.2802 | -0.1524 | 0.361 | 0.3769 | 0.1786 | 0.275 |
2018-01-04 12:00:00 | 0.0853 | 0.1450 | 0.1361 | 0.0478 | 0.0566 | 0.1179 | 0.25 | 0.1302 | 0.0026 | 0.1850 | ... | 0.1998 | 0.2970 | 0.0655 | 0.3794 | 0.2802 | -0.0614 | 0.515 | 0.3769 | 0.2731 | 0.275 |
2018-01-05 12:00:00 | 0.0853 | 0.1623 | 0.1293 | 0.0427 | 0.0566 | 0.1267 | 0.25 | 0.1633 | -0.0210 | 0.1948 | ... | 0.1998 | 0.1991 | 0.0907 | 0.3794 | 0.2802 | -0.0614 | 0.515 | 0.5964 | 0.3221 | 0.275 |
5 rows × 490 columns
The plot below show the values of the sentiment
factor after applying the value function defined above. The factor values of 10 securities in the universe are presented, but the number of securities can be adjusted by changing the num_securities
argument.
dataset_analyzer.plot_data_shape(num_securities=10,
y_axis_title='Sentiment score',
subplot_title_extension='of Brain Sentiment Indicator')
The following table shows statistical attributes for the sentiment
factor values. The four central moments and the p-values for a normality test are presented for the factor. The null hypothesis of this normality test is that the factor values are normally distributed.
A small p-value (i.e. <= 0.05) provides statistical evidence to reject the null hypothessis. To test normality, the Jarque Bera test is used when the number of data points is >2,0001. Otherwise, the Shapiro-Wilk test is used. To remove outliers from the data before calculating these statistics, the factor values can be winsorized by providing the winsorize_limits
argument, which ignores the bottom x% and top y% of factor values.
dataset_analyzer.calculate_statistics(winsorize_limits=(0.025, 0.025))
Sentiment | |
---|---|
Universe Statistic | |
Mean | 0.134592 |
Standard deviation | 0.112330 |
Skewness | -0.027931 |
Kurtosis | -0.084097 |
Normality test P-value | 0.000000 |
The coefficient of determination, R2, quantifies how much variation of the dependent variable is explained by the independent variables in a linear regression model. An R2 of 1 means changes in the dependent variable can be perfectly explained by changes in the independent variables. R2 of 0 means changes in the dependent variable can't be explained by changes in the independent variables.
When analyzing the R2 of a linear regression model, the p-value of the t-test should also be considered. The t-test tests the null hypothesis that the coefficient of a factor in the model is zero. A p-value below a level of significance (traditionally, 0.05) suggests the null hypotheses can be rejected. As a result, a simple linear regression model with predictive power and a statistically significant factor exhibits a high R2 and a low t-test p-value.
In the plots below, we present a plot to show the R2 and t-test p-value that results from training a simple linear regression model for each security in the universe using the sentiment
factor as the independent variable. An informative factor that forecasts the return over the following 21 days extremely well across the entire universe will have all of the dots in the bottom right corner of the plot below.
dataset_analyzer.measure_significance()
The table below shows the correlation between the sentiment
factor in the Brain Sentiment Indicator dataset and factors from other datasets. In addition to the correlation coefficient, the p-value of the correlation coefficient is displayed in parentheses. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a correlation at least as extreme as the one computed from these factors2,3. To analyze the correlation with the sentiment
factor and additional factors from other datasets, clone this notebook and extend the DemoCorrelationDatasets
class in the factor.py file.
dataset_analyzer.calculate_factor_correlations(DemoCorrelationDatasets().other_dataset_factor_by_class)
Factor Correlation Coefficient (P-Value) | |
---|---|
Sentiment | |
Sentiment | 1.0000 (0.0000) |
QuiverWallStreetBets.rank | 0.1804 (0.0000) |
QuiverWallStreetBets.sentiment | 0.0208 (0.0000) |
QuiverQuantTwitterFollowers.followers | -0.2191 (0.0000) |
USTreasuryYieldCurveRate.onemonth | -0.0209 (0.0000) |
In the cell below, we'll analyze the performance of a variety of supervised machine learning models using the sentiment
factor as the feature and the return over the following 21 days as the label. To limit the number of inputs to each model, each security in the universe will have its own model to train. Once the models are fit, we'll test the model of each security on out-of-sample data and produce a distribution of the accuracy of the models.
Regression models have a continuous dependent variable. In this example, we train the regression model to predict the return over the next 21 days after each observation of the sentiment
factors. To determine the exposure of the positions in our portfolio when using regression models, we take the sign of the prediction. That is, if the future return prediction is >0, we take a long position in the security. On the other hand, if the return prediction is <= 0, we take a short position in the security.
Linear regression models find a line of best fit through a set of data points. To fit a simple linear regression model with one independent variable, a scatter plot is created using the independent variable on the x-axis and the dependent variable on the y-axis. A straight line is then drawn through the scatter plot that minimizes the vertical distance (the mean squared error) between the line and all of the data points. The best-fit line can then be used to make predictions when given new observations for the sentiment
factors. Depending on the number of independent variables, single or mulitple linear regression can be performed.
Classification models have a discrete dependent variable. In this example, we train the classification models to predict the sign of the return over the next 21 days after each observation of the sentiment
factors. When the models predict a positive future return, we take a long position in the security. On the other hand, if the return prediction is negative, we take a short position in the security.
Logisitic regression models accept continuous or categorical data points as the independent variables and output a binary dependent variable. These models work by assigning a likelihood value to each of the two binary outputs when given values for the independent variables. The binary value with the largest likelihood is selected as the output for the observation of independent variables.
Support vector machine classifiers classify observations into two classes by finding the point on the line of factor values that maximize the margin between the two classes. To reduce model variance, support vector machines use cross validation to find the optimal point on the line of factor values.
Decision trees classify observations into multiple classes by building a tree of decision nodes and leaf nodes. The decision nodes compare the factor values of the observation to determine which direction the classifier should traverse further down the tree. When the classifier reaches a leaf node, it classifies the observation according to the class of the leaf node.
Random forests classify observations into multiple classes by training many decision tree classifiers on subsets of the training data. When an observation needs to be classified, the random forest classifier weighs the decisions made by the collection of decision trees to determine which class the observation most likely belongs to.
dataset_analyzer.run_ml_models(
regression_models=[LinearRegression()],
classifier_models=[LogisticRegression(),
SVC(),
DecisionTreeClassifier(max_depth=3, min_samples_leaf=20),
RandomForestClassifier(n_estimators=10, max_depth=2, random_state = 1990, min_samples_leaf=20)])
In-Sample Accuracy Distribution | Out-of-Sample Accuracy Distribution | |||
---|---|---|---|---|
Mean | Standard Deviation | Mean | Standard Deviation | |
LinearRegression | 0.5946 | 0.0560 | 0.5974 | 0.1493 |
LogisticRegression | 0.6006 | 0.0518 | 0.6071 | 0.1510 |
SVC | 0.6287 | 0.0406 | 0.5919 | 0.1544 |
DecisionTreeClassifier | 0.6468 | 0.0399 | 0.5758 | 0.1438 |
RandomForestClassifier | 0.6384 | 0.0383 | 0.5822 | 0.1495 |
The Brain Sentiment Indicator dataset enables researchers to incorporate sentiment from financial news sources into their strategies. Possible approaches to be tested:
License the Brain Sentiment Indicator dataset to begin further research and start live trading.