Abstract

Naïve Bayes models have become popular for their success in spam email filtering. In this tutorial, we train Gaussian Naïve Bayes (GNB) classifiers to forecast the daily returns of stocks in the technology sector given the historical returns of the sector. Our implementation shows the strategy has a greater Sharpe and lower variance than the SPY ETF over a 5 year backtest and during the 2020 stock market crash. The algorithm we build here follows the research done by Lu (2016) and Imandoust & Bolandraftar (2014).

Background

Naïve Bayes models classify observations into a set of classes by utilizing Bayes’ Theorem

posterior=prior × likelihoodevidence

In symbols, this translates to

P(ci|x1,...,xn)=P(ci)P(x1,...,xn|ci)P(x1,...,xn)

where ci represents one of the m classes and x1,...,xn are the features.

The Naïve Bayes model assumes the features are independent, so that

P(ci|x1,...,xn)=P(ci)nj=1P(xj|ci)P(x1,...,xn)P(ci)nj=1P(xj|ci)

The class that is most probable given the observation is then determined by solving

ˆc=argmaxi{1,...,m}P(ci)nj=1P(xj|ci)

In our use case, the classes in the model are: positive, negative, or flat future return for a security. The features are the last 4 daily returns of the universe constituents. Since we are dealing with continuous data, we extend the model to a GNB model by replacing P(xj|ci) in the equation above. First, we find the mean μj and standard deviation σ2j of the xj feature vector in the training set labeled class ci. A normal distribution parameterized by μj and σ2j is then used to determine the likelihood of the observations. If o is the observation for the jth feature. The likelihood of the observation given the class ci is

P(xj=o|ci)=12πσ2je(oμj)22σ2j

The mechanics of the GNB model can be seen visually in this video. Note that the GNB model has 2 underlying assumptions: the feature vectors are independent and normally distributed. We do not test for these properties, but rather leave it as an area of future research.

Video Walkthrough

Method

Universe Selection

Following Lu (2016), we implement a custom universe selection model to select the largest stocks from the technology sector. We restrict our universe to have a size of 10, but this can be easily customized via the fine_size parameter in the constructor.

  1. class BigTechUniverseSelectionModel(FundamentalUniverseSelectionModel):
  2. def __init__(self, fine_size=10):
  3. self.fine_size = fine_size
  4. self.month = -1
  5. super().__init__(True)
  6. def select_coarse(self, algorithm, coarse):
  7. if algorithm.time.month == self.month:
  8. return Universe.UNCHANGED
  9. return [ x.symbol for x in coarse if x.has_fundamental_data ]
  10. def select_fine(self, algorithm, fine):
  11. self.month = algorithm.time.month
  12. tech_stocks = [ f for f in fine if f.asset_classification.morningstar_sector_code == MorningstarSectorCode.TECHNOLOGY ]
  13. sorted_by_market_cap = sorted(tech_stocks, key=lambda x: x.market_cap, reverse=True)
  14. return [ x.symbol for x in sorted_by_market_cap[:self.fine_size] ]
+ Expand

Alpha Construction

The GaussianNaiveBayesAlphaModel predicts the direction each security will move from a given day’s open to the next day’s open. When constructing this Alpha model, we set up a dictionary to hold a SymbolData object for each symbol in the universe and a flag to show the universe has changed.

  1. class GaussianNaiveBayesAlphaModel(AlphaModel):
  2. symbol_data_by_symbol = {}
  3. new_securities = False

Alpha Securities Management

When a new security is added to the universe, we create a SymbolData object for it to store information unique to the security. The management of the SymbolData objects occurs in the Alpha model's OnSecuritiesChanged method. In this algorithm, since we train the Gaussian Naive Bayes classifier using the historical returns of the securities in the universe, we flag to train the model every time the universe changes.

  1. class GaussianNaiveBayesAlphaModel(AlphaModel):
  2. ...
  3. def on_securities_changed(self, algorithm, changes):
  4. for security in changes.added_securities:
  5. self.symbol_data_by_symbol[security.symbol] = SymbolData(security, algorithm)
  6. for security in changes.removed_securities:
  7. symbol_data = self.symbol_data_by_symbol.pop(security.symbol, None)
  8. if symbol_data:
  9. symbol_data.dispose()
  10. self.new_securities = True

SymbolData Class

The SymbolData class is used to store training data for the GaussianNaiveBayesAlphaModel and manage a consolidator subscription. In the constructor, we specify the training parameters, setup the consolidator, and warm up the training data.

  1. class SymbolData:
  2. def __init__(self, security, algorithm, num_days_per_sample=4, num_samples=100):
  3. self.exchange = security.exchange
  4. self.symbol = security.symbol
  5. self.algorithm = algorithm
  6. self.num_days_per_sample = num_days_per_sample
  7. self.num_samples = num_samples
  8. self.previous_open = 0
  9. self.model = None
  10. # Setup consolidators
  11. self.consolidator = TradeBarConsolidator(timedelta(days=1))
  12. self.consolidator.data_consolidated += self.custom_daily_handler
  13. algorithm.subscription_manager.add_consolidator(self.symbol, self.consolidator)
  14. # Warm up ROC lookback
  15. self.roc_window = np.array([])
  16. self.labels_by_day = pd.Series()
  17. data = {f'{self.symbol.id}_(t-{i})' : [] for i in range(1, num_days_per_sample + 1)}
  18. self.features_by_day = pd.DataFrame(data)
  19. lookback = num_days_per_sample + num_samples + 1
  20. history = algorithm.history(self.symbol, lookback, Resolution.DAILY)
  21. if history.empty or 'close' not in history:
  22. algorithm.log(f"Not enough history for {self.symbol} yet")
  23. return
  24. history = history.loc[self.symbol]
  25. history['open_close_return'] = (history.close - history.open) / history.open
  26. start = history.shift(-1).open
  27. end = history.shift(-2).open
  28. history['future_return'] = (end - start) / start
  29. for day, row in history.iterrows():
  30. self.previous_open = row.open
  31. if self.update_features(day, row.open_close_return) and not pd.isnull(row.future_return):
  32. row = pd.Series([np.sign(row.future_return)], index=[day])
  33. self.labels_by_day = self.labels_by_day.append(row)[-self.num_samples:]
+ Expand

The update_features method is called to update our training features with the latest data passed to the algorithm. It returns True/False, representing if the features are in place to start updating the training labels.

  1. class SymbolData:
  2. ...
  3. def update_features(self, day, open_close_return):
  4. self.roc_window = np.append(open_close_return, self.roc_window)[:self.num_days_per_sample]
  5. if len(self.roc_window) < self.num_days_per_sample:
  6. return False
  7. self.features_by_day.loc[day] = self.roc_window
  8. self.features_by_day = self.features_by_day[-(self.num_samples+2):]
  9. return True

Model Training

The GNB model is trained each day the universe has changed. By default, it uses 100 samples to train. The features are the historical open-to-close returns of the universe constituents. The labels are the returns from the open at T+1 to the open at T+2 at each time step for each security.

  1. class GaussianNaiveBayesAlphaModel(AlphaModel):
  2. ...
  3. def train(self):
  4. features = pd.DataFrame()
  5. labels_by_symbol = {}
  6. # Gather training data
  7. for symbol, symbol_data in self.symbol_data_by_symbol.items():
  8. if symbol_data.is_ready:
  9. features = pd.concat([features, symbol_data.features_by_day], axis=1)
  10. labels_by_symbol[symbol] = symbol_data.labels_by_day
  11. # Train the GNB model
  12. for symbol, symbol_data in self.symbol_data_by_symbol.items():
  13. if symbol_data.is_ready:
  14. symbol_data.model = GaussianNB().fit(features.iloc[:-2], labels_by_symbol[symbol])
+ Expand

Alpha Update

As new TradeBars are provided to the Alpha model's Update method, we collect the open-to-close return of the latest TradeBar for each security in the universe. We then predict the direction of each security using the security’s corresponding GNB model, and return Insight objects accordingly.

  1. class GaussianNaiveBayesAlphaModel(AlphaModel):
  2. ...
  3. def update(self, algorithm, data):
  4. if self.new_securities:
  5. self.train()
  6. self.new_securities = False
  7. tradable_symbols = {}
  8. features = [[]]
  9. for symbol, symbol_data in self.symbol_data_by_symbol.items():
  10. if data.contains_key(symbol) and data[symbol] is not None and symbol_data.is_ready:
  11. tradable_symbols[symbol] = symbol_data
  12. features[0].extend(symbol_data.features_by_day.iloc[-1].values)
  13. insights = []
  14. if len(tradable_symbols) == 0:
  15. return []
  16. weight = 1 / len(tradable_symbols)
  17. for symbol, symbol_data in tradable_symbols.items():
  18. direction = symbol_data.model.predict(features)
  19. if direction:
  20. insights.append(Insight.price(symbol, data.time + timedelta(days=1, seconds=-1),
  21. direction, None, None, None, weight))
  22. return insights
+ Expand

Portfolio Construction & Trade Execution

We utilize the InsightWeightingPortfolioConstructionModel and the ImmediateExecutionModel.

Relative Performance

Period Name Start Date End Date Strategy Sharpe Variance
5 Year Backtest 10/1/2015 10/13/2020 Strategy 0.011 0.013
Benchmark 0.729 0.024
2020 Crash 2/19/2020 3/23/2020 Strategy -1.433 0.236
Benchmark -1.467 0.416
2020 Recovery 3/23/2020 6/8/2020 Strategy -0.156 0.028
Benchmark 4.497 0.072

Author

Derek Melchin

November 2020