Gaussian Naive Bayes Model

Abstract

Naïve Bayes models have become popular for their success in spam email filtering. In this tutorial, we train Gaussian Naïve Bayes (GNB) classifiers to forecast the daily returns of stocks in the technology sector given the historical returns of the sector. Our implementation shows the strategy has a greater Sharpe and lower variance than the SPY ETF over a 5 year backtest and during the 2020 stock market crash. The algorithm we build here follows the research done by Lu (2016) and Imandoust & Bolandraftar (2014).

Background

Naïve Bayes models classify observations into a set of classes by utilizing Bayes’ Theorem

$\text{posterior} = \frac{ \text{prior } \times \text{ likelihood} } {\text{evidence}}$

In symbols, this translates to

$P(c_i | x_1, ..., x_n) = \frac{P(c_i)P(x_1, ..., x_n | c_i)}{P(x_1, ..., x_n)}$

where $c_i$ represents one of the $m$ classes and $x_1, ..., x_n$ are the features.

The Naïve Bayes model assumes the features are independent, so that

$\begin{equation} \begin{aligned} P(c_i | x_1, ..., x_n) & = \frac{P(c_i)\prod_{j=1}^{n} P(x_j | c_i)}{P(x_1, ..., x_n)} \\ & \propto P(c_i)\prod_{j=1}^{n} P(x_j|c_i) \end{aligned} \end{equation}$

The class that is most probable given the observation is then determined by solving

$\hat{c} = \arg\max_{i \in \{1, ..., m\}} P(c_i) \prod_{j=1}^{n} P(x_j | c_i)$

In our use case, the classes in the model are: positive, negative, or flat future return for a security. The features are the last 4 daily returns of the universe constituents. Since we are dealing with continuous data, we extend the model to a GNB model by replacing $P(x_j|c_i)$ in the equation above. First, we find the mean $\mu_j$ and standard deviation $\sigma_j^2$ of the $x_j$ feature vector in the training set labeled class $c_i$ . A normal distribution parameterized by $\mu_j$ and $\sigma_j^2$ is then used to determine the likelihood of the observations. If $o$ is the observation for the $j$ th feature. The likelihood of the observation given the class $c_i$ is

$P(x_j = o | c_i) = \frac{1} {\sqrt{2 \pi{} \sigma{}_j^2 }}e^{- \frac{(o - \mu{}_j)^2} {2 \sigma{}_j^2}}$

The mechanics of the GNB model can be seen visually in this video. Note that the GNB model has 2 underlying assumptions: the feature vectors are independent and normally distributed. We do not test for these properties, but rather leave it as an area of future research.

Video Walkthrough

Method

Universe Selection

Following Lu (2016), we implement a custom universe selection model to select the largest stocks from the technology sector. We restrict our universe to have a size of 10, but this can be easily customized via the fine_size parameter in the constructor.

class BigTechUniverseSelectionModel(FundamentalUniverseSelectionModel):
    def __init__(self, fine_size=10):
        self.fine_size = fine_size
        self.month = -1
        super().__init__(True)
    def select_coarse(self, algorithm, coarse):
        if algorithm.time.month == self.month:
            return Universe.UNCHANGED
        return [ x.symbol for x in coarse if x.has_fundamental_data ]
    def select_fine(self, algorithm, fine):
        self.month = algorithm.time.month
        tech_stocks = [ f for f in fine if f.asset_classification.morningstar_sector_code == MorningstarSectorCode.TECHNOLOGY ]
        sorted_by_market_cap = sorted(tech_stocks, key=lambda x: x.market_cap, reverse=True)
        return [ x.symbol for x in sorted_by_market_cap[:self.fine_size] ]
+ Expand
- Collapse

Alpha Construction

The GaussianNaiveBayesAlphaModel predicts the direction each security will move from a given day’s open to the next day’s open. When constructing this Alpha model, we set up a dictionary to hold a SymbolData object for each symbol in the universe and a flag to show the universe has changed.

class GaussianNaiveBayesAlphaModel(AlphaModel):
    symbol_data_by_symbol = {}
    new_securities = False

Alpha Securities Management

When a new security is added to the universe, we create a SymbolData object for it to store information unique to the security. The management of the SymbolData objects occurs in the Alpha model's OnSecuritiesChanged method. In this algorithm, since we train the Gaussian Naive Bayes classifier using the historical returns of the securities in the universe, we flag to train the model every time the universe changes.

class GaussianNaiveBayesAlphaModel(AlphaModel):
    ...
    def on_securities_changed(self, algorithm, changes):
        for security in changes.added_securities:
            self.symbol_data_by_symbol[security.symbol] = SymbolData(security, algorithm)
        for security in changes.removed_securities:
            symbol_data = self.symbol_data_by_symbol.pop(security.symbol, None)
            if symbol_data:
                symbol_data.dispose()
        self.new_securities = True

SymbolData Class

The SymbolData class is used to store training data for the GaussianNaiveBayesAlphaModel and manage a consolidator subscription. In the constructor, we specify the training parameters, setup the consolidator, and warm up the training data.

class SymbolData:
    def __init__(self, security, algorithm, num_days_per_sample=4, num_samples=100):
        self.exchange = security.exchange
        self.symbol = security.symbol
        self.algorithm = algorithm
        self.num_days_per_sample = num_days_per_sample
        self.num_samples = num_samples
        self.previous_open = 0
        self.model = None
        # Setup consolidators
        self.consolidator = TradeBarConsolidator(timedelta(days=1))
        self.consolidator.data_consolidated += self.custom_daily_handler
        algorithm.subscription_manager.add_consolidator(self.symbol, self.consolidator)
        # Warm up ROC lookback
        self.roc_window = np.array([])
        self.labels_by_day = pd.Series()
        data = {f'{self.symbol.id}_(t-{i})' : [] for i in range(1, num_days_per_sample + 1)}
        self.features_by_day = pd.DataFrame(data)
        lookback = num_days_per_sample + num_samples + 1
        history = algorithm.history(self.symbol, lookback, Resolution.DAILY)
        if history.empty or 'close' not in history:
            algorithm.log(f"Not enough history for {self.symbol} yet")    
            return
        history = history.loc[self.symbol]
        history['open_close_return'] = (history.close - history.open) / history.open
        start = history.shift(-1).open
        end = history.shift(-2).open
        history['future_return'] = (end - start) / start
        for day, row in history.iterrows():
            self.previous_open = row.open
            if self.update_features(day, row.open_close_return) and not pd.isnull(row.future_return):
                row = pd.Series([np.sign(row.future_return)], index=[day])
                self.labels_by_day = self.labels_by_day.append(row)[-self.num_samples:]
+ Expand
- Collapse

The update_features method is called to update our training features with the latest data passed to the algorithm. It returns True/False, representing if the features are in place to start updating the training labels.

class SymbolData:
    ...
    def update_features(self, day, open_close_return):
        self.roc_window = np.append(open_close_return, self.roc_window)[:self.num_days_per_sample]
        if len(self.roc_window) < self.num_days_per_sample:
            return False
        self.features_by_day.loc[day] = self.roc_window
        self.features_by_day = self.features_by_day[-(self.num_samples+2):]
        return True

Model Training

The GNB model is trained each day the universe has changed. By default, it uses 100 samples to train. The features are the historical open-to-close returns of the universe constituents. The labels are the returns from the open at $T+1$ to the open at $T+2$ at each time step for each security.

class GaussianNaiveBayesAlphaModel(AlphaModel):
    ...
    def train(self):
        features = pd.DataFrame()
        labels_by_symbol = {}
        # Gather training data
        for symbol, symbol_data in self.symbol_data_by_symbol.items():
            if symbol_data.is_ready:
                features = pd.concat([features, symbol_data.features_by_day], axis=1)
                labels_by_symbol[symbol] = symbol_data.labels_by_day
        # Train the GNB model
        for symbol, symbol_data in self.symbol_data_by_symbol.items():
            if symbol_data.is_ready:
                symbol_data.model = GaussianNB().fit(features.iloc[:-2], labels_by_symbol[symbol])
+ Expand
- Collapse

Alpha Update

As new TradeBars are provided to the Alpha model's Update method, we collect the open-to-close return of the latest TradeBar for each security in the universe. We then predict the direction of each security using the security’s corresponding GNB model, and return Insight objects accordingly.

class GaussianNaiveBayesAlphaModel(AlphaModel):
    ...
    def update(self, algorithm, data):
        if self.new_securities:
            self.train()
            self.new_securities = False
        tradable_symbols = {}
        features = [[]]
        for symbol, symbol_data in self.symbol_data_by_symbol.items():
            if data.contains_key(symbol) and data[symbol] is not None and symbol_data.is_ready:
                tradable_symbols[symbol] = symbol_data
                features[0].extend(symbol_data.features_by_day.iloc[-1].values)
        insights = []
        if len(tradable_symbols) == 0:
            return []
        weight = 1 / len(tradable_symbols)
        for symbol, symbol_data in tradable_symbols.items():
            direction = symbol_data.model.predict(features)
            if direction:
                insights.append(Insight.price(symbol, data.time + timedelta(days=1, seconds=-1), 
                                              direction, None, None, None, weight))
        return insights
+ Expand
- Collapse

Portfolio Construction & Trade Execution

We utilize the InsightWeightingPortfolioConstructionModel and the ImmediateExecutionModel.

Relative Performance

Period Name	Start Date	End Date	Strategy	Sharpe	Variance
5 Year Backtest	10/1/2015	10/13/2020	Strategy	0.011	0.013
5 Year Backtest	10/1/2015	10/13/2020	Benchmark	0.729	0.024
2020 Crash	2/19/2020	3/23/2020	Strategy	-1.433	0.236
2020 Crash	2/19/2020	3/23/2020	Benchmark	-1.467	0.416
2020 Recovery	3/23/2020	6/8/2020	Strategy	-0.156	0.028
2020 Recovery	3/23/2020	6/8/2020	Benchmark	4.497	0.072

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.

Hi Derek,

Thanks for developing and sharing this code. I cloned it and ran a backtest on it. I got the following error when I ran it. Append works on lists and not series in python so this may be an easy fix. Any thoughts?

8
|
2:17:03
:
Runtime Error: 'Series' object has no attribute 'append'
at __getattr__
return object.__getattribute__(self, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
in generic.py: line 6203
at __init__
self.labels_by_day = self.labels_by_day.append(row)[-self.num_samples:]
^^^^^^^^^^^^^^^^^^^^^^^^^
in symbol_data.py: line 60
at OnSecuritiesChanged
self.symbol_data_by_symbol[security.Symbol] = SymbolData(security, algorithm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
in alpha.py: line 65 (Open Stack Trace)

10
|
12:12:39
:
Runtime Error: Input X contains NaN.
GaussianNB does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See <a href='https://scikit-learn.org/stable/modules/impute.html' target='blank'>https://scikit-learn.org/stable/modules/impute.html</a> You can find a list of all estimators that handle NaN values at the following page: <a href='https://scikit-learn.org/stable/modules/impute.html' target='blank'>https://scikit-learn.org/stable/modules/impute.html</a>#estimators-that-handle-nan-values
at _assert_all_finite_element_wise
raise ValueError(msg_err)
in validation.py: line 174
at _assert_all_finite
_assert_all_finite_element_wise(
in validation.py: line 125
at check_array
_assert_all_finite(
in validation.py: line 1048
at check_X_y
X = check_array(
^^^^^^^^^^^^
in validation.py: line 1262
at _validate_data
X, y = check_X_y(X, y, **check_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
in base.py: line 649
at _partial_fit
X, y = self._validate_data(X, y, reset=first_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
in naive_bayes.py: line 422
at fit
return self._partial_fit(
^^^^^^^^^^^^^^^^^^
in naive_bayes.py: line 262
at wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
in base.py: line 1473
at _train
symbol_data.model = GaussianNB().fit(features.iloc[:-2], labels_by_symbol[symbol])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
in alpha.py: line 88
at update
self._train()
in alpha.py: line 30 (Open Stack Trace)

+ Expand

Derek Melchin INVESTOR

QuantConnect | May 2024

See the attached backtest for an updated version of the algorithm in PEP8 style.

Upvote

Lou Pino INVESTOR

June 2024

Derek Melchin

QuantConnect | June 2024

Hi Lou, please run the PEP8 version of the algorithm above. It fixes this error.

Serhii INVESTOR

December 2024

Hi Derek.

Thanks, tried to clone both version (original + pep8), but get the error:

`

10
|
12:12:39
:
Runtime Error: Input X contains NaN.
GaussianNB does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See <a href='https://scikit-learn.org/stable/modules/impute.html' target='blank'>https://scikit-learn.org/stable/modules/impute.html</a> You can find a list of all estimators that handle NaN values at the following page: <a href='https://scikit-learn.org/stable/modules/impute.html' target='blank'>https://scikit-learn.org/stable/modules/impute.html</a>#estimators-that-handle-nan-values
  at _assert_all_finite_element_wise
    raise ValueError(msg_err)
 in validation.py: line 174
  at _assert_all_finite
    _assert_all_finite_element_wise(
 in validation.py: line 125
  at check_array
    _assert_all_finite(
 in validation.py: line 1048
  at check_X_y
    X = check_array(
        ^^^^^^^^^^^^
 in validation.py: line 1262
  at _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 in base.py: line 649
  at _partial_fit
    X, y = self._validate_data(X, y, reset=first_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 in naive_bayes.py: line 422
  at fit
    return self._partial_fit(
           ^^^^^^^^^^^^^^^^^^
 in naive_bayes.py: line 262
  at wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 in base.py: line 1473
  at _train
    symbol_data.model = GaussianNB().fit(features.iloc[:-2], labels_by_symbol[symbol])
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 in alpha.py: line 88
  at update
    self._train()
 in alpha.py: line 30 (Open Stack Trace) 
+ Expand
- Collapse

QuantConnect | December 2024

Hi Serhii,

After this algorithm was published, we've switched the default value of the daily_precise_end_time setting in LEAN from False to True. So to fix the error, change the setting back to False.

self.settings.daily_precise_end_time = False

See the attached backtest for reference.

Best,
Derek Melchin

Platform

Gaussian Naive Bayes Model

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research Publications

VOTE FOR UPCOMING FEATURES

JOIN OUR Research MAILING LIST

Abstract

Background

Video Walkthrough

Method

Universe Selection

Alpha Construction

Alpha Securities Management

SymbolData Class

Model Training

Alpha Update

Portfolio Construction & Trade Execution

Relative Performance

IN THIS RESEARCH

PARTICIPANTS

Actions

Join QuantConnect for Free

Platform

SIGN IN

Gaussian Naive Bayes Model

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research Publications

VOTE FOR UPCOMING FEATURES

JOIN OUR Research MAILING LIST

Abstract

Background

Video Walkthrough

Method

Universe Selection

Alpha Construction

Alpha Securities Management

SymbolData Class

Model Training

Alpha Update

Portfolio Construction & Trade Execution

Relative Performance

IN THIS RESEARCH

PARTICIPANTS

SHARE RESEARCH

SHARE DISCUSSION

SHARE ARTICLE

SHARE

Actions

Join QuantConnect for Free