QC Top 5 Python Natural Language Processing Libraries

Natural language processing (NLP) techniques automate text-based analysis. There are many tasks NLP techniques can be applied to including search, translation, question answering, part-of-speech tagging, document parsing, and much more. QuantConnect supports several Python libraries for NLP tasks, the top five of which are Natural Language Toolkit(NLTK), Spacy, Gensim, Scikit-Learn, and Beautiful Soup. You can find all of the supported Python libraries here.

5 Python Libraries for NLP in QuantConnect

Natural Language Toolkit (NLTK): The tools in this library assist with classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities. See an example using QC's API.

Spacy: The tools in this library are on par with those from NLTK, but have an added level of abstraction that makes it more user-friendly outside of academia. Popular tools you can use in the library are part-of-speech tagging and named-entity recognition.

Gensim: The tools in this library are used for unsupervised topic modeling and NLP for large text collections. Unsupervised topic modeling uses vectors to group text by subject based on the document as a whole. Gensim tools are useful for implementing algorithms such as online Latent Semantic Analysis, Latent Dirichlet Allocation, and word2vec deep learning.

Scikit-learn: This library is primarily used to enable machine learning. When used with text, it can help create a model that trains on existing text classifications (eg. is this paragraph positive or negative in sentiment) and test the model's precision and accuracy at classifying on new text. Packages from this library can be used to assign positive and negative sentiment scores to text and perform necessary vectorization for categorizing new text using machine learning.

Beautiful Soup: A parser for HTML and XML text, this library transforms text into four main Python objects: BeautifulSoup, NavigableString, Tags, and Comments. You can search for these objects in a parsed tree, allowing you to find and work on parts of a large document. BeautifulSoup refers to the document as a whole, NavigableString refers to text within a tag, Tags correspond to the original HTML tag, and Comments are a type of NavigableString.

Let's focus on NLP applications that analyze documents to forecast stock movements and make use of QuantConnect's alternative data sources.

Processing Data with the NLP Pipeline

Natural language processing starts with raw text which needs to be processed. Different steps can be taken in the processing pipeline.

Raw Text - text from a source that comes in any format

Tokenization - separating words in a document into a list of string characters, or tokens

Stemming - reducing words to their stem (ie. running → run (stem))

Lemmatization - grouping words that have similar meaning by context of their surrounding words

Stop Words - words that are extremely common and hold little meaning (ie. "the", "a")

Document - cleaned text that is parsed into a list of words

Using Processed Documents for Tasks

The resulting document of parsed text can be used for many tasks, such as sentiment analysis, which assigns a positive or negative polarity score to words in a document. We can map security objects with scores assigned to the text about a security. The weight of the scores can be used to indicate whether public sentiment towards a company is positive or negative. (You can find a simple example of sentiment analysis that doesn't use a NLP library with this QC Boot Camp lesson.) The benefit of using a library is that you can do much more text processing. Depending on the text you start with, you may need more preprocessing tools.

Using NLP techniques on QC Alternative Data

Below are two alternative data sources supported by QuantConnect that can be used for text analysis tasks.

Data Source: Tiingo News

Tiingo crawls the web for financial news articles and delivers each article in the form of a TiingoNews object. Tiingo's data library contains approximately 8,000-12,000 articles from every day since January 1st 2014. Tiingo news events are published at the time Tiingo crawls the data. Once a new source is added, the delay between published time and crawl time typically ranges from a few minutes to one hour. The text is delivered as a one-sentence description and headline for an article. From there, you would need to tokenize the text before using it for analysis. You can find a demonstration lesson using Tiingo data here.

Data Source: Edgar SEC Filings

SEC's 8-K reports are notices to investors outside of earnings with more of a qualitative approach and less hard figures than a traditional 10-Q report would contain. Sometimes SEC data contains binary data — such as a PDF file contained as a text blob— whereas other times it's an HTML page or just text.

An example of a use case may be searching for <type>EX-99.1, and anything below that will most likely be what you're looking for, although it must be cleaned up before analyzing for sentiment. The raw text is provided as-is and therefore it is a great data source to try out processing tools in an NLP library. You can find a demonstration algorithm here.

Ultimately, there is so much that can be done with NLP Libraries. This is just a start. We'd love to see your ideas — please share them with us and the rest of the QC Community in the forum. Happy coding!

Platform

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

374,200 Quants.

VOTE FOR UPCOMING FEATURES

JOIN OUR Announcements MAILING LIST

5 Python Libraries for NLP in QuantConnect

Processing Data with the NLP Pipeline

Using Processed Documents for Tasks

Using NLP techniques on QC Alternative Data

IN THIS RESEARCH

PARTICIPANTS

Actions

Join QuantConnect for Free

Platform

SIGN IN

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

374,200 Quants.

VOTE FOR UPCOMING FEATURES

JOIN OUR Announcements MAILING LIST

5 Python Libraries for NLP in QuantConnect

Processing Data with the NLP Pipeline

Using Processed Documents for Tasks

Using NLP techniques on QC Alternative Data

IN THIS RESEARCH

PARTICIPANTS

SHARE RESEARCH

SHARE DISCUSSION

SHARE ARTICLE

SHARE

Actions

Join QuantConnect for Free