dataset_analysis

Summary Analysis of Sentiment Indicator by Brain¶

Sentiment Indicator by Brain measures the public sentiment around US Equities. The data covers 4,500 US Equities, starting in August 2016, and is delivered on a daily frequency. This dataset is created by analyzing financial news using Natural Language Processing techniques while taking into account the similarity and repetition of news on the same topic. The sentiment score assigned to each stock is a value ranging from -1 (most negative) to +1 (most positive) that is updated with a daily frequency. The sentiment score corresponds to the average sentiment for each piece of news and it is available on two time scales; 7 days and 30 days.

Specify research settings and gather data¶

In the following cell, we create a DatasetAnalyzer to analyze a factor from the Brain Sentiment Indicator dataset. We select the sentiment factor from the dataset and define a value function to transform the raw values into a factor we want to study. In this case, the value function returns the raw value of the dataset as it is processed already. We select our universe to be as the constituents of SPY as of 2021-08-31. As our target label, we analyze the return of the 21 days following each observation of the factor values. The in-sample period we review is 2018-01-01 to 2021-01-01 and the out-of-sample period is 2021-01-01 to 2021-08-01.

The table displayed after the cell contains the raw factor values of the dataset before the value function is applied. After this cell, the factor values are transformed using the value function, and the raw factor values are no longer used.

In [1]:

from dataset_analyzer import *

# (1) Select the dataset to analyze
universe = ETFUniverse("SPY", datetime(2021, 8, 31))

# (2) Select the tickers that are linked to the dataset
# For linked datasets, this should be equal to `universe`.
# For unlinked datasets, provide 1 link, for ex: ['REG']
dataset_tickers = universe

# (3) Define the factors to study
factors = [
    Factor('sentiment', 'Sentiment', 'continuous', None)
]

# (4) Create the DatasetAnalyzer instance
dataset_analyzer = DatasetAnalyzer(dataset = BrainSentimentIndicator30Day, 
                                   dataset_tickers = dataset_tickers,
                                   universe = universe,
                                   factors = factors,
                                   sparse_data = True, 
                                   dataset_start_date = datetime(2018, 1, 1), 
                                   in_sample_end_date = datetime(2021, 1, 1), 
                                   out_of_sample_end_date = datetime(2021, 8, 1), 
                                   return_prediction_period=21)

	sentiment
symbol	AA.BrainSentimentIndicator30Day R735QTJ8XC9W	AAL.BrainSentimentIndicator30Day VM9RIYHM8ACK	AAP.BrainSentimentIndicator30Day SA48O8J43YAS	AAPL.BrainSentimentIndicator30Day R735QTJ8XC9W	AAS.BrainSentimentIndicator30Day R735QTJ8XC9W	ABBV.BrainSentimentIndicator30Day VCY032R250MC	ABMD.BrainSentimentIndicator30Day R735QTJ8XC9W	ABT.BrainSentimentIndicator30Day R735QTJ8XC9W	ACCOB.BrainSentimentIndicator30Day R735QTJ8XC9W	ACL.BrainSentimentIndicator30Day R735QTJ8XC9W	...	WYNN.BrainSentimentIndicator30Day SJ56738TCX9G	XLNX.BrainSentimentIndicator30Day R735QTJ8XC9W	XON.BrainSentimentIndicator30Day R735QTJ8XC9W	XRAY.BrainSentimentIndicator30Day R735QTJ8XC9W	XYL.BrainSentimentIndicator30Day V18KR26TE3XG	YUM.BrainSentimentIndicator30Day R735QTJ8XC9W	ZBRA.BrainSentimentIndicator30Day R735QTJ8XC9W	ZION.BrainSentimentIndicator30Day R735QTJ8XC9W	ZMH.BrainSentimentIndicator30Day S6ZZPKTVDY04	ZTS.BrainSentimentIndicator30Day VDRJHVQ4FNFO
time
2018-01-01 12:00:00	0.1862	0.1045	0.1781	0.0464	0.0228	0.1169	0.25	0.0580	-0.0562	0.1343	...	0.2429	0.2706	0.0663	0.3000	0.2802	-0.1524	0.361	0.1395	0.1528	0.275
2018-01-02 12:00:00	0.1746	0.0964	0.1361	0.0452	0.0228	0.1169	0.25	0.0867	-0.0392	0.1343	...	0.2429	0.2970	0.0656	0.3794	0.2802	-0.1524	0.361	0.1395	0.1748	0.275
2018-01-03 12:00:00	0.0853	0.1092	0.1361	0.0483	0.0228	0.1140	0.25	0.0985	0.0026	0.1343	...	0.1998	0.2970	0.0791	0.3794	0.2802	-0.1524	0.361	0.3769	0.1786	0.275
2018-01-04 12:00:00	0.0853	0.1450	0.1361	0.0478	0.0566	0.1179	0.25	0.1302	0.0026	0.1850	...	0.1998	0.2970	0.0655	0.3794	0.2802	-0.0614	0.515	0.3769	0.2731	0.275
2018-01-05 12:00:00	0.0853	0.1623	0.1293	0.0427	0.0566	0.1267	0.25	0.1633	-0.0210	0.1948	...	0.1998	0.1991	0.0907	0.3794	0.2802	-0.0614	0.515	0.5964	0.3221	0.275

5 rows × 490 columns

Shape of the data¶

The plot below show the values of the sentiment factor after applying the value function defined above. The factor values of 10 securities in the universe are presented, but the number of securities can be adjusted by changing the num_securities argument.

Data normality and statistics¶

The following table shows statistical attributes for the sentiment factor values. The four central moments and the p-values for a normality test are presented for the factor. The null hypothesis of this normality test is that the factor values are normally distributed.

A small p-value (i.e. <= 0.05) provides statistical evidence to reject the null hypothessis. To test normality, the Jarque Bera test is used when the number of data points is >2,000¹. Otherwise, the Shapiro-Wilk test is used. To remove outliers from the data before calculating these statistics, the factor values can be winsorized by providing the winsorize_limits argument, which ignores the bottom $x$ % and top $y$ % of factor values.

	Sentiment
Mean	0.134592
Standard deviation	0.112330
Skewness	-0.027931
Kurtosis	-0.084097
Normality test P-value	0.000000

Coefficient of determination and statistical significance¶

The coefficient of determination, $R^2$ , quantifies how much variation of the dependent variable is explained by the independent variables in a linear regression model. An $R^2$ of 1 means changes in the dependent variable can be perfectly explained by changes in the independent variables. $R^2$ of 0 means changes in the dependent variable can't be explained by changes in the independent variables.

When analyzing the $R^2$ of a linear regression model, the p-value of the t-test should also be considered. The t-test tests the null hypothesis that the coefficient of a factor in the model is zero. A p-value below a level of significance (traditionally, 0.05) suggests the null hypotheses can be rejected. As a result, a simple linear regression model with predictive power and a statistically significant factor exhibits a high $R^2$ and a low t-test p-value.

In the plots below, we present a plot to show the $R^2$ and t-test p-value that results from training a simple linear regression model for each security in the universe using the sentiment factor as the independent variable. An informative factor that forecasts the return over the following 21 days extremely well across the entire universe will have all of the dots in the bottom right corner of the plot below.

Factor correlation¶

The table below shows the correlation between the sentiment factor in the Brain Sentiment Indicator dataset and factors from other datasets. In addition to the correlation coefficient, the p-value of the correlation coefficient is displayed in parentheses. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a correlation at least as extreme as the one computed from these factors^2,3. To analyze the correlation with the sentiment factor and additional factors from other datasets, clone this notebook and extend the DemoCorrelationDatasets class in the factor.py file.

	Factor Correlation Coefficient (P-Value)
Sentiment	1.0000 (0.0000)
QuiverWallStreetBets.rank	0.1804 (0.0000)
QuiverWallStreetBets.sentiment	0.0208 (0.0000)
QuiverQuantTwitterFollowers.followers	-0.2191 (0.0000)
USTreasuryYieldCurveRate.onemonth	-0.0209 (0.0000)

Machine learning¶

In the cell below, we'll analyze the performance of a variety of supervised machine learning models using the sentiment factor as the feature and the return over the following 21 days as the label. To limit the number of inputs to each model, each security in the universe will have its own model to train. Once the models are fit, we'll test the model of each security on out-of-sample data and produce a distribution of the accuracy of the models.

Regression models¶

Regression models have a continuous dependent variable. In this example, we train the regression model to predict the return over the next 21 days after each observation of the sentiment factors. To determine the exposure of the positions in our portfolio when using regression models, we take the sign of the prediction. That is, if the future return prediction is >0, we take a long position in the security. On the other hand, if the return prediction is <= 0, we take a short position in the security.

Linear regression¶

Linear regression models find a line of best fit through a set of data points. To fit a simple linear regression model with one independent variable, a scatter plot is created using the independent variable on the x-axis and the dependent variable on the y-axis. A straight line is then drawn through the scatter plot that minimizes the vertical distance (the mean squared error) between the line and all of the data points. The best-fit line can then be used to make predictions when given new observations for the sentiment factors. Depending on the number of independent variables, single or mulitple linear regression can be performed.

Classification models¶

Classification models have a discrete dependent variable. In this example, we train the classification models to predict the sign of the return over the next 21 days after each observation of the sentiment factors. When the models predict a positive future return, we take a long position in the security. On the other hand, if the return prediction is negative, we take a short position in the security.

Logistic regression¶

Logisitic regression models accept continuous or categorical data points as the independent variables and output a binary dependent variable. These models work by assigning a likelihood value to each of the two binary outputs when given values for the independent variables. The binary value with the largest likelihood is selected as the output for the observation of independent variables.

SVM classifier¶

Support vector machine classifiers classify observations into two classes by finding the point on the line of factor values that maximize the margin between the two classes. To reduce model variance, support vector machines use cross validation to find the optimal point on the line of factor values.

Decision tree classifier¶

Decision trees classify observations into multiple classes by building a tree of decision nodes and leaf nodes. The decision nodes compare the factor values of the observation to determine which direction the classifier should traverse further down the tree. When the classifier reaches a leaf node, it classifies the observation according to the class of the leaf node.

Random forest classifier¶

Random forests classify observations into multiple classes by training many decision tree classifiers on subsets of the training data. When an observation needs to be classified, the random forest classifier weighs the decisions made by the collection of decision trees to determine which class the observation most likely belongs to.

	In-Sample Accuracy Distribution	Out-of-Sample Accuracy Distribution
LinearRegression	0.5946	0.0560	0.5974	0.1493
LogisticRegression	0.6006	0.0518	0.6071	0.1510
SVC	0.6287	0.0406	0.5919	0.1544
DecisionTreeClassifier	0.6468	0.0399	0.5758	0.1438
RandomForestClassifier	0.6384	0.0383	0.5822	0.1495

Suggested exploratory pathways¶

The Brain Sentiment Indicator dataset enables researchers to incorporate sentiment from financial news sources into their strategies. Possible approaches to be tested:

Buying when the public sentiment for a security is increasing
Short selling when the public sentiment for a security is decreasing
Scaling the position sizing of securities based on how many times they are mentioned in financial news articles

License the data¶

License the Brain Sentiment Indicator dataset to begin further research and start live trading.

Reference¶

scipy.stats.jarque_bera. SciPy documentation. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.jarque_bera.html
scipy.stats.spearmanr. SciPy documentation. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html
scipy.stats.pearsonr. SciPy documentation. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

	Sentiment
Universe Statistic
Mean	0.134592
Standard deviation	0.112330
Skewness	-0.027931
Kurtosis	-0.084097
Normality test P-value	0.000000