book
Checkout our new book! Hands on AI Trading with Python, QuantConnect, and AWS Learn More arrow

Popular Models

FinBERT

Introduction

This page explains how to use FinBERT in LEAN trading algorithms. The model repository provides the following description:

FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification. Financial PhraseBank by Malo et al. (2014) is used for fine-tuning. For more details, please see the paper FinBERT: Financial Sentiment Analysis with Pre-trained Language Models and our related blog post on Medium.

The model will give softmax outputs for three labels: positive, negative or neutral.

Use Cases

The FinBERT model is a sentiment analysis model. The following use cases explain how you might utilize it in trading algorithms:

  • Analyze the sentiment of the latest news articles for specific companies, then form a long-short portfolio with assets that have the most postive and most negative news sentiment.
  • Monitor the sentiment of regulatory alerts in a risk management model and liquidate holdings when sentiment is extremely negative.
  • Generate sentences based on information from other datasets and then feed them into the model to determine the sentiment. For example, you could use the US Government Contracts dataset to create the string "The Department of State grants AAPL a contract for the purchase of mobile phones".

Load Pre-Trained Model

Follow these steps to load the pre-trained FinBERT model:

  1. Import the model and tokenizer classes.
  2. from transformers import TFBertForSequenceClassification, BertTokenizer
  3. Define the path were the model is stored.
  4. In QuantConnect Cloud, the path is ProsusAI / finbert.

    model_path = "ProsusAI/finbert"
  5. Create a TFBertForSequenceClassification model.
  6. self._model = TFBertForSequenceClassification.from_pretrained(model_path, local_files_only=True)
  7. Create a BertTokenizer object.
  8. self._tokenizer = BertTokenizer.from_pretrained(model_path, local_files_only=True)
  9. (Optional) Set the seed to enable reproducibility.
  10. from transformers import set_seed
    set_seed(1, True)

Fine-Tune Model

The FinBERT model is pre-trained, so you don't need to fine-tune it. Fine-tuning the model just tailors it to your specific use case. Follow these steps to fine-tune it:

  1. Import the Dataset class.
  2. from datasets import Dataset
  3. Load the pre-trained model.
  4. Compile the model with an optimizer and loss function.
  5. model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5), 
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    )

    You only need to compile the model once. TensorFlow offers many other optimizer and loss classes you can use.

  6. Create a DataFrame that contains your training samples.
  7. The DataFrame should have two columns, named "text" and "label". The label must be an integer that represents the sentiment class. By default, the FinBERT model has three classes. Class zero represents negative sentiment, class one represents neutral sentiment, and class two represents positive sentiment.

    samples = pd.DataFrame(columns=['text', 'label'])
    # Add rows to the DataFrame...
  8. Convert the samples DataFrame to a tf.data.Dataset object.
  9. dataset = Dataset.from_pandas(samples)
  10. Tokenize the text in each training sample.
  11. dataset = dataset.map(
        lambda sample: self._tokenizer(sample['text'], padding='max_length', truncation=True)
    )
  12. Call the model's prepare_tf_dataset method.
  13. dataset = model.prepare_tf_dataset(dataset, shuffle=True, tokenizer=self._tokenizer)
  14. Call the model's fit method.
  15. model.fit(dataset, epochs=2)

Analyze Sentiment

Follow these steps to analyze the sentiment of some text with FinBERT:

  1. Load the model.
  2. (Optional) Fine-tune the model.
  3. Get the text you want the model to analyze.
  4. The model can analyze a single sentence or a list of sentences.

    content = "AAPL stock price spikes after record-breaking sales figures."
  5. Tokenize the text(s).
  6. inputs = self._tokenizer(content, padding=True, truncation=True, return_tensors='tf')

    For more information about how to tokenize, see the PreTrainedTokenizer.__call__ reference on the Hugging Face website.

  7. Perform dictionary unpacking on the preceding result and pass it to the model as input.
  8. outputs = self._model(**inputs)
  9. Apply softmax to the outputs to get probabilities of each sentiment class.
  10. scores = tf.nn.softmax(outputs.logits, axis=-1).numpy()

    The result of the preceding operation is a two dimensional numpy array. Each element in the two dimensional array is a list that contains the probability that the sentiment of the corresponding sentence is negative, neutral, or postiive, respectively. For example, you may get the following result if you use a single sentence as input. The result shows that the input is more positive than negative, but is likely neutral.

    array([[0.21346861, 0.46771246, 0.318819]])

Examples

The following examples demonstrate usage of the FinBERT model.

Example 1: 10-day Sentiment Score

The following algorithm selects a volatile asset at the beginning of each month. It gets the Tiingo News articles that were released for the asset over the previous 10 days and then feeds them into the pre-trained FinBERT model. It then aggregates the sentiment scores of all the news releases. If the aggregated sentiment score is positive, it enters a long position for the month. If it's negative, it enters a short position for the month.

import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer, set_seed
from pathlib import Path

class FinbertBaseModelAlgorithm(QCAlgorithm):

    def initialize(self):
        self.set_start_date(2022, 1, 1)
        self.set_end_date(2022, 2, 1)
        self.set_cash(100_000)

        spy = Symbol.create("SPY", SecurityType.EQUITY, Market.USA)
        self.universe_settings.resolution = Resolution.DAILY
        self.universe_settings.schedule.on(self.date_rules.month_start(spy))
        self._universe = self.add_universe(
            lambda fundamental: [
                self.history(
                    [f.symbol for f in sorted(fundamental, key=lambda f: f.dollar_volume)[-10:]], 
                    timedelta(365), Resolution.DAILY
                )['close'].unstack(0).pct_change().iloc[1:].std().idxmax()
            ]
        )

        set_seed(1, True)
        
        # Load the tokenizer and the model
        model_path = "ProsusAI/finbert"
        self._tokenizer = BertTokenizer.from_pretrained(model_path, local_files_only=True)
        self._model = TFBertForSequenceClassification.from_pretrained(model_path, local_files_only=True)

        self._last_rebalance_time = datetime.min
        self.schedule.on(
            self.date_rules.month_start(spy, 1),
            self.time_rules.midnight,
            self._trade
        )

        self.set_warm_up(timedelta(30))

    def on_warmup_finished(self):
        self._trade()

    def on_securities_changed(self, changes):
        for security in changes.removed_securities:
            self.remove_security(security.dataset_symbol)
        for security in changes.added_securities:
            security.dataset_symbol = self.add_data(TiingoNews, security.symbol).symbol

    def _trade(self):
        if self.is_warming_up or self.time - self._last_rebalance_time < timedelta(14):
            return

        # Get the target security.
        security = self.securities[list(self._universe.selected)[0]]

        # Get the latest news articles.
        articles = self.history[TiingoNews](security.dataset_symbol, 10, Resolution.DAILY)
        article_text = [article.description for article in articles]
        if not article_text:
            return

        # Prepare the input sentences
        inputs = self._tokenizer(article_text, padding=True, truncation=True, return_tensors='tf')

        # Get the model outputs
        outputs = self._model(**inputs)

        # Apply softmax to the outputs to get probabilities
        scores = tf.nn.softmax(outputs.logits, axis=-1).numpy()
        self.log(f"{str(scores)}")
        scores = self._aggregate_sentiment_scores(scores)
        
        self.plot("Sentiment Probability", "Negative", scores[0])
        self.plot("Sentiment Probability", "Neutral", scores[1])
        self.plot("Sentiment Probability", "Positive", scores[2])

        # Rebalance
        weight = 1 if scores[2] > scores[0] else -0.25
        self.set_holdings(security.symbol, weight, True)
        self._last_rebalance_time = self.time

    def _aggregate_sentiment_scores(self, sentiment_scores):
        n = sentiment_scores.shape[0]
        
        # Generate exponentially increasing weights
        weights = np.exp(np.linspace(0, 1, n))
        
        # Normalize weights to sum to 1
        weights /= weights.sum()
        
        # Apply weights to sentiment scores
        weighted_scores = sentiment_scores * weights[:, np.newaxis]
        
        # Aggregate weighted scores by summing them
        aggregated_scores = weighted_scores.sum(axis=0)
        
        return aggregated_scores

Example 2: 30-day Sentiment Score With Fine Tuning

The following algorithm selects a volatile asset at the beginning of each month. It gets the Tiingo News articles that were released for the asset over the previous 30 days to generate the training set. The label is the market return that occurs from the current news release to the next news release. The algorithm then fine-tunes the model, calculates the sentiment, and rebalances the portfolio.

import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer, set_seed
from pathlib import Path
from datasets import Dataset
import pytz
import torch

class FinbertBaseModelAlgorithm(QCAlgorithm):

    def initialize(self):
        self.set_start_date(2022, 1, 1)
        self.set_end_date(2022, 2, 1)
        self.set_cash(100_000)

        spy = Symbol.create("SPY", SecurityType.EQUITY, Market.USA)
        self.universe_settings.resolution = Resolution.DAILY
        self.universe_settings.schedule.on(self.date_rules.month_start(spy))
        self._universe = self.add_universe(
            lambda fundamental: [
                self.history(
                    [f.symbol for f in sorted(fundamental, key=lambda f: f.dollar_volume)[-10:]], 
                    timedelta(365), Resolution.DAILY
                )['close'].unstack(0).pct_change().iloc[1:].std().idxmax()
            ]
        )

        set_seed(1, True)
        
        self._last_rebalance_time = datetime.min
        self.schedule.on(
            self.date_rules.month_start(spy, 1),
            self.time_rules.midnight,
            self._trade
        )

        self.set_warm_up(timedelta(30))

        self._model_name = "ProsusAI/finbert"
        self._tokenizer = BertTokenizer.from_pretrained(self._model_name) 

    def on_warmup_finished(self):
        self._trade()

    def on_securities_changed(self, changes):
        for security in changes.removed_securities:
            self.remove_security(security.dataset_symbol)
        for security in changes.added_securities:
            security.dataset_symbol = self.add_data(
                TiingoNews, security.symbol
            ).symbol

    def _trade(self):
        if (self.is_warming_up or 
            self.time - self._last_rebalance_time < timedelta(14)):
            return

        # Get the target security.
        security = self.securities[list(self._universe.selected)[0]]

        # Get samples to fine-tune the model
        samples = pd.DataFrame(columns=['text', 'label'])
        news_history = self.history(security.dataset_symbol, 30, Resolution.DAILY)
        if news_history.empty:
            return
        news_history = news_history.loc[security.dataset_symbol]['description']
        asset_history = self.history(
            security.symbol, timedelta(30), Resolution.SECOND
        ).loc[security.symbol]['close']
        for i in range(len(news_history.index)-1):
            # Get factor (article description).
            factor = news_history.iloc[i]
            if not factor:
                continue

            # Get the label (the market reaction to the news, for now).
            release_time = self._convert_to_eastern(news_history.index[i])
            next_release_time = self._convert_to_eastern(news_history.index[i+1])
            reaction_period = asset_history[
                (asset_history.index > release_time) &
                (asset_history.index < next_release_time + timedelta(seconds=1))
            ]
            if reaction_period.empty:
                continue
            label = (
                (reaction_period.iloc[-1] - reaction_period.iloc[0]) 
                / reaction_period.iloc[0]
            )
            
            # Save the training sample.
            samples.loc[len(samples), :] = [factor, label]

        samples = samples.iloc[-100:]
        
        if samples.shape[0] < 10:
            self.liquidate()
            return
        
        # Classify the market reaction into positive/negative/neutral.
        # 75% of the most negative labels => class 0 (negative)
        # 75% of the most postiive labels => class 2 (positive)
        # Remaining labels                => class 1 (netural)
        sorted_samples = samples.sort_values(by='label', ascending=False).reset_index(drop=True)
        percent_signed = 0.75
        positive_cutoff = (
            int(percent_signed 
            * len(sorted_samples[sorted_samples.label > 0]))
        )
        negative_cutoff = (
            len(sorted_samples) 
            - int(percent_signed * len(sorted_samples[sorted_samples.label < 0]))
        )
        sorted_samples.loc[list(range(negative_cutoff, len(sorted_samples))), 'label'] = 0
        sorted_samples.loc[list(range(positive_cutoff, negative_cutoff)), 'label'] = 1
        sorted_samples.loc[list(range(0, positive_cutoff)), 'label'] = 2       

        # Load the pre-trained model.
        model = TFBertForSequenceClassification.from_pretrained(
            self._model_name, num_labels=3, from_pt=True
        )
        # Compile the model.
        model.compile(
            optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5), 
            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
        )
        # Create the training dataset.
        dataset = Dataset.from_pandas(sorted_samples)
        dataset = dataset.map(
            lambda sample: self._tokenizer(
                sample['text'], padding='max_length', truncation=True
            )
        )
        dataset = model.prepare_tf_dataset(
            dataset, shuffle=True, tokenizer=self._tokenizer
        )
        # Train the model.
        model.fit(dataset, epochs=2)
        # Prepare the input sentences.
        inputs = self._tokenizer(
            list(samples['text'].values), padding=True, truncation=True, 
            return_tensors='tf'
        )

        # Get the model outputs.
        outputs = model(**inputs) 

        # Apply softmax to the outputs to get probabilities.
        scores = tf.nn.softmax(outputs.logits, axis=-1).numpy()
        scores = self._aggregate_sentiment_scores(scores)
        
        self.plot("Sentiment Probability", "Negative", scores[0])
        self.plot("Sentiment Probability", "Neutral", scores[1])
        self.plot("Sentiment Probability", "Positive", scores[2])

        # Rebalance.
        weight = 1 if scores[2] > scores[0] else -0.25
        self.set_holdings(security.symbol, weight, True)
        self._last_rebalance_time = self.time

    def _convert_to_eastern(self, dt):
        return dt.astimezone(pytz.timezone('US/Eastern')).replace(tzinfo=None)

    def _aggregate_sentiment_scores(self, sentiment_scores):
        n = sentiment_scores.shape[0]
        
        # Generate exponentially increasing weights
        weights = np.exp(np.linspace(0, 1, n))
        
        # Normalize weights to sum to 1
        weights /= weights.sum()
        
        # Apply weights to sentiment scores
        weighted_scores = sentiment_scores * weights[:, np.newaxis]
        
        # Aggregate weighted scores by summing them
        aggregated_scores = weighted_scores.sum(axis=0)
        
        return aggregated_scores

These algorithms require a GPU node.

You can also see our Videos. You can also get in touch with us via Discord.

Did you find this page helpful?

Contribute to the documentation: