Accurately predicting stock performance involves acquiring highly coveted data, and a new prototype using Natural Language Processing (NLP) and Deep Learning is showing very promising results.
“For investment firms, predicting likely under-performers may be the most valuable prediction of all, allowing them to avoid losses on investments that will not fare well,” writes Patty Ryan, Principal Data and Applied Scientist at Microsoft.
By partnering with a financial services company “to develop a model to predict the future stock market performance of public companies in categories where they invest,” the team at Microsoft modeled its prototype on just one industry, the biotechnology industry, which had the most abundant within-industry sample.
The project goal was to discern whether they could outperform the chance accuracy of 33.33%, but the results went way beyond that 33.33%.
What they found was a 62% accuracy for predicting the under-performing company, almost double what the chance accuracy was.
But how did they do it? Therein lies the question, but the answer may be found among industry buzzwords and a whole lot of code.
If you are a developer or familiar with these tools and concepts, Patty Ryan did a remarkable job walking you through the entire process, step-by-meticulous-step on the Microsoft Developer blog.
However, I will attempt to summarize and layout how they did it here.
The stock perfomance prediction prototype technical aspects
Natural Language Processing (NLP), pre-processing, and Deep Learning were utilized in order “to prototype a predictive model to render consistent judgments on a company’s future prospects, based on the written textual sections of public earnings releases extracted from 10k releases and actual stock market performance.”
For input the team then proceeded to gather “a text corpus of two years of earnings release information for thousands of public companies worldwide.” They “extracted as source the sections 1, 1A, 7 and 7A from each company’s 10k — the business discussion, management overview, and disclosure of risks and market risks.”
Additionally, they “gathered the stock price of each of the companies on the day of the earnings release and the stock price four weeks later,” categorizing the public companies by industry category.
The tools used included Python with Azure Machine Learning Workbench, Jupyter Notebook, and NLP tools including the Gensim library.
Machine Learning, according to Microsoft General Manager and Executive Producer of the Decoded Show, Dave Mendlen, is something that happens “on the server side” to “bring the best information forward.”
“Let’s take this technology and enable it to learn on its own,” says Mendlen on Machine Learning, adding, “and put that in the back-end for developers to take use of. If you are building an application, you can use that to do amazing things. They tend to be things that I’ll call back-end things or processing things that the user doesn’t necessarily see directly.”
One of the difficulties that arose in the stock performance prediction prototype was that the “pre-trained word vectors” they used as a model had a limited vocabulary of some 400,000 words. Many industries have specific vocabulary that is not used outside their particular niche. However, the “GloVe pre-trained model of all of Wikipedia’s 2014 data” did prove useful in allowing the team to “vectorize” its document set and prepare it for deep learning toolkits.
After embedding all the documents and data, they were then “able to take advantage of a convolutional neural network (CNN) model to learn the classifications.”
More number crunching, embedding, and model training pursued, and in the end, the “prototype model results, while modest, suggest there is a useful signal available on future performance classification in at least the biotechnology industry based on the target text from the 10-K.”
The future looks promising
Ryan says that “while the model needs to be improved with more samples, refinements of domain-specific vocabulary, and text augmentation, it suggests that providing this signal as another decision input for investment analyst would improve the efficiency of the firm’s analysis work.”
“Overall, this prototype validated additional investment by our partner in natural language based deep learning to improve efficiency, consistency, and effectiveness of human reviews of textual reports and information.”
So, the initial chance of stock performance prediction was at 33.33% before the project began, and that was raised to 62% accuracy through NLP, Deep Learning, Convolutional Neural Networking, and a host of developer tools in tow.