XGBoost Regression Model Prediction of Future Returns based on Indicator and Market Percent Change Data

Hello, below is an example of a machine learning algorithm using the XGBoost library. I am running a regression model on future returns with equity data. I am using indicator data and market data as features. I am using the RSI (Relative Strength Index) and ATR (Average True Range) indicators, as well as Market (SPY) ROCP (Rate Of Change Percent). The hypothesis is that these indicators will give my regressor model enough to make reasonably accurate predictions on price movements and then trade the ones that it predicts are going to move a significant amount. Notebook attached for more fundamental look at the model. I used RandomizedSearchCV() in this to save computation usage, but feel free to switch it to GridSearchCV() for more of a determined approach.

Conclusion:

This is a good first step into ML using the QC platform. I would love to see improved versions posted as a fun challenge to anyone who is interested. I will be linking a medium article that is dropping about this algorithm as soon as it is published. Happy coding!

Disclaimer:

This is for entertainment purposes and not financial advice. Also, it is not advised to trade this algorithm at its current state as it still could use improvements and further review. The algorithm also has many assumptions that can be cleaned up by refactoring it with more dynamic code, such as the prediction value threshold that is currently hardcoded in as 0.5 for longs and -0.5 for shorts. There are many other improvements as well, for instance, a quick review of the overview tab reveals that the algorithm only had a win rate of 40% and a loss of 60%, even if the wins were on average larger, this could pose as a potential red flag. if you see something wrong or any problems, please comment below, let me know what is wrong and I will try to review and fix any issues. Also, I can only get this backtest to produce with outputs similar to this backtest with the v2 of the QuantConnect Development Environment. For some reason, the results are drastically different from the v2 IDE backtest than the v3 IDE backtest. I reached out to QuantConnect and they couldn't find a good reason for this other than possibly switching to GridSearch(), which I tried with no luck. Got me stumped, so if any smart people figure out why, be sure to share😊 Here is the code for the GridSearchCV() that I tried, in case anyone is interested.

self.models[symbol] = GridSearchCV(estimator=self.models[symbol], param_distributions=parameters, n_iter=10, scoring='neg_mean_squared_error', cv=4, verbose=1)

So, if you want similar results you will have to run this on the v2 IDE that you can switch to in your account settings as of the date of this post.

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.

Hey Dan,

Thanks for sharing this and it looks like a great first attempt.

Couple of thoughts here:

In the model training, it looks like you are using only the last 24 hourly indicator values as input features. (In other words your input features are shape (24, 3) with columns [ROCP, RSI, ATR]?) What about market data, and/or lagged features? This seems like a very small amount of data to be able to train the ML model effectively without overfitting.
Shouldn't the target variable be future returns? It looks like the target here is `scaled_rocp` which means you are learning the mapping from indicators at time t to the return between [t-24, t].
With 10 symbols in the universe (refreshed monthly), you are training 10 separate models for each symbol. This seems unnecessarily computationally expensive if the universe was larger, and you would also be discarding potentially very useful information in the cross-section.
The v2 and v3 difference in results might be because the random operations weren't seeded, try adding `random_state` in the `RandomizedSearchCV`

indicator_df = pd.DataFrame(np.hstack((scaled_market_rocp, scaled_rsi, scaled_atr)))
self.Log(f'Data Shape: {indicator_df.shape}')
x_train, x_valid, y_train, y_valid = train_test_split(indicator_df, scaled_rocp, test_size=0.35, random_state=42)
self.Log(f'X (Train): {x_train.shape}')
self.Log(f'X (val): {x_valid.shape}')
>>> Data Shape: (24, 3)
>>> X (Train): (15, 3)
>>> X (val): (9, 3)

Dan Root INVESTOR

May 2022

Here is the article about this algorithm that got published on the Medium platform.

Upvote

Adam W INVESTOR

HamCap | May 2022

June 2022

Hello Adam,

First, thank you for the comment!

Okay so, the Market ROCP Data is one of the indicators? In the SymbolData class the market_rocp is created with only data on the SPY index. Also, the 24 shape is deceiving as that is the same size I made my rolling windows, so that is 24 windows of 24 hours. This means there is quite a bit of input data. You can see that in the SymbolData class as well.

indicator_df = pd.DataFrame(np.hstack((scaled_market_rocp, scaled_rsi, scaled_atr)))

This is a simple ML algorithm designed to help understand the fundamentals of implementing ML on the QC platform and is not intended to be a production-ready algorithm. Also, I am posting this as a community challenge to see the different and modified examples of this strategy that other quants choose to post. As for the low data, as I mentioned above I think 24 windows of 24 hours is sufficient for this simple strategy. I was experimenting with using the target ROCP or a scaled ROCP and decided on the latter due to the performance increase. It has a quantile effect on the targets only trading the ones that are valued higher or lower after scaling. Then, I did some tuning and exploring which IMO is part of the experience and in this case, I preferred to use the scaled_rocp as the target. As for the seeding the random state options, I just tried this after reading your reply with no effect. Same results with the two environments behaving the same on the backtests. Thanks for the suggestion though! If you figure it out let me know. I appreciate the thoughtful feedback! I would love to see a modified version from you posted with your other changes, as the point of this is to make this algorithm better with the help community as a shared learning exercise. So, posting shared results is always encouraged! Happy coding! 😊

HamCap | June 2022

Hey Dan, totally get that it's a simple algo for learning/getting familiar with QC and I'm always glad to see more people working with ML on here!

So I think there might be a small misunderstanding here of how `RollingWindow` works in QC, or perhaps an unintended bug in the code? You can run for instance:

indicator_df = pd.DataFrame(np.hstack((scaled_market_rocp, scaled_rsi, scaled_atr)))
self.Log(f'Data Shape: {indicator_df.shape}')
x_train, x_valid, y_train, y_valid = train_test_split(indicator_df, scaled_rocp, test_size=0.35, random_state=42)
self.Log(f'X (Train): {x_train.shape}')
self.Log(f'X (val): {x_valid.shape}')
>>> Data Shape: (24, 3)
>>> X (Train): (15, 3)
>>> X (val): (9, 3)

So the RSI rolling window you defined would represent the previous 12 hr RSI at timepoints { t-24, t-23, …, t-1, t } where t is the current time that you are training the model.

Now for the target, you define it as (and possibly scaled) `rocp = algorithm.ROCP(symbol, 24)` which is the %-return over the period [t-24, t]. In other words, this is ( price_t - price_{t-24} ) / price_{t-24}. I am guessing that what you intended is future returns, i.e. the %-return over the period [t, t+24] or something like that, since in principle price_t is already known at time t.

Also one thing I missed initially - in the Predict function you generally wouldn't want to be scaling the data differently from the model training. If the RSI during training is [ 0.45, 0.5, 0.55 ], this gets scaled to [ -1, 0, 1 ]. If the new live RSI is [ 0.05, 0.1, 0.15 ], this also gets scaled to [ -1, 0, 1 ] if you re-fit the scaler, so the ML model will treat these two sets of features as identical when they are not.

Hey Alex,

Thank you for the constructive criticism!

The splitting of data in training and not in the prediction function makes sense. Guess, that I assumed the eval_set would fix that, but that makes sense to me for the shape issues. That is a good example of a bug that needs to be squashed, and I love to see quants like yourself finding them! Good catch! As for the other recommendations and fixes. I am only using the last item from the scaled indicator_df when making a prediction, so just the prior day predicting the next day. Using the .pct_change() method could be a better implementation though. The scaling happens the same on the predicted indicators as the trained ones, and they are scaled using MinMaxScaler() to a range of -1 to 1 for each value in the numpy array. I understand that on a small length of input features that scaling issue could and would occur, but for features with a length of 24 wouldn't it be a very rare issue? Also, this seems like a drawback of scaling data in general. I do see your point though! This is exactly why you should post an example of a modified version. I would love to see the improvements to help my understanding of your feedback! It would also help illustrate your recommendations and fixes better for anyone reading. Thanks for the valuable input either way! 😊

Correction, sorry I meant Adam not Alex on that last message. 😊

Axist INVESTOR

Axist Capital | December 2022

This used to run and up till recently there must've been a change to Quantconnect as the algo seems to hang indefinitely after training the model during the backtest. Is this occurring for anyone else?

Derek Melchin INVESTOR

QuantConnect | January 2023

Hi Axist,

The algorithm doesn't hang if we reduce the training parameters. For example:

parameters = {
    'n_estimators': [2],
    'learning_rate': [0.001],
    'max_depth': [2],
    'gamma': [0.001],
    'random_state': [42]
}

Best,
Derek Melchin

1 person upvoted this

Platform

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

373,900 Quants.

VOTE FOR UPCOMING FEATURES

JOIN OUR Community MAILING LIST

IN THIS RESEARCH

PARTICIPANTS

Actions

Join QuantConnect for Free