img

5 Painful Mistakes I Made Creating an ML Trading Bot, So You Don’t Need To

img
valuezone 11 February 2023

5 Painful Mistakes I Made Creating an ML Trading Bot, So You Don’t Need To

FOREWORD

In the series of articles, I describe my road in creating a reliable fully automated trading bot based on data from 500 companies from the S&P list and share my knowledge and experience with the world.

Here is a short recap of what has been done so far:

  1. I collected and brought together into one dataset companies’ financial data and some external factors like US bonds yield, market volatility etc. My guess is that financial statements and major market indicators should explain market valuation better than time-series waves analysis.
  2. My latest attempt with the stock movement prediction showed promising results about the model precision, and I decided to move on and publish to test it in the real world. More about its creation you can read here: “10 Years of Financial Data from 500 Traded Companies Data Led Me To 295% Profitable ML Model”

Now let’s get close to the details!

It’s been a while since my last posting on Medium, and most of the time, I’ve been on pause because of my personal life swifts — I moved with a family to another city and changed a job. But outside that, I spent a lot of time rebuilding my scripts because of the silly mistakes I made in the process. Some of them are well known, but some are not publicly discussed.

Here are the five lessons I want you to remember, and don’t repeat my mistakes.

I. QUESTION YOUR INPUT DATA

The perfect example to cover this claim is related to the pure stock prices themselves.

If you have ever checked the stock price historical data online, in the majority of cases, you were looking at the adjusted price value. The adjusted price is a convenient way to show historical changes because it takes into account all the splits that ever happened and brings those to today’s value, so we don’t see big jumps or falls on a chart.


Coca-Cola adjusted to dividends price (orange) and not adjusted to dividends price (blue)

You may not know that the adjusted price also takes the dividends paid into the account. It means that the actual price for some companies might be slightly different year over year, but with a compound, this difference is substantial.

In the chart above, you may see a comparison of two price calculations for the Coca-Cola company since 2000. The orange is the one you would publicly see as a general adjusted price. The blue one is the price that is adjusted to historical share splits but is not adjusted to the dividends paid. As you can see, the difference in values 20 years ago is about 1.5 times!

As we build our model based on historical data, we need the values in our dataset to be as close to the real world as possible and small details like this can skew your model substantially.

II. SELECT THE RELIABLE DATA SOURCES FROM THE VERY BEGINNING

Sounds easy, right? Read my story before skipping this part.

As I said before, I spent a lot of time rebuilding my scripts from scratch; and mostly the ETL processes. Anyone who works in this area knows that this slice is the most labour-intensive and substantial part of the whole discipline.

Why did I do so? Your guess is right, I started getting problems with the data sources from the next day after going live!


Photo by Elisa Ventur on Unsplash

A lot of the data I pulled were from online pages. And even so, the number of requests wasn’t big, the websites started to check the robots on my scripts that interrupted the code. In one way I could play a cat-mouse game for a while and adjust my logic, but it’s annoying and I definitely did not want to spend time on this.

I definitely was eager to find an API that would constantly work without major headaches. But the other painful moment was finding out that financial data may cost you a penny. For example data from Zack.com, which I used in my script, may take up to 1000$ / per month for different types of requests (). Since it’s still my hobby project, and I recently pulled the exact data for free, I wasn’t ready to open my wallet easily.

The search for new sources took me a month and after a comparison of all the perks and available information, I found a resource that contains everything I need in their API, and doesn’t cost a fortune. If anyone is interested, I encourage you to try. The only drawback is the amount of financial data they have is 5 years, and it keeps growing.

If you want to learn from other’s mistakes, then take it seriously from the very beginning and invest your time and some money in good data sources — your original data quality should be the last thing you want to think about.

III. DON’T IGNORE K-FOLD VALIDATION

I used to split my datasets to Train / Test / Validation parts and get the best matching model based on the results of test and validation bias. If anyone of you did the same and skipped using the K-Fold validation, I can give you a perfect example of why it is essential for your future models.

There is a number of different K-Fold types to choose from. In my case, I used the Time Series Split as I am working with the time-series data even knowing that my approach doesn’t apply to any time-series ML models like Arima or Prophet.

The idea is that after the project goes live, I would like to relearn my model every 2–3 months when more data is collected. And at this exact moment, I realized that the standard approach with the Train / Test / Validation split gives a little guarantee that after relearning the model would perform on the same level.


Time Series Split version of K-Fold validation logic

Without K-Fold validation I would basically blindly trust that under the hood my model is ok, knowing that I selected it based on 1 run. Wouldn’t it be better to run multiple tests on training data and simulate a real-world scenario I am about to go implement?

That’s how I understood that K-Fold validation is important when you’re planning to launch a model that would incrementally relearn based on the new data. BTW it really helped me to fine-tune my variables in order to get an average good performance.

IV. THE EDA OF YOUR INPUT VARIABLES IS IMPORTANT

Explanatory Data Analysis is an integral part of any project. Sometimes there is a craving to skip it as all the input variables seem to be examined and known. That was my case.

If you remember I had to change my API data provider and redo the variables creation process from the very beginning. All the variables in my model go through the long process of transformation and combination. I spent a lot of time getting everything as clean as possible before they went into the model and after the exercise was over I believed there was nothing new that I could get with the EDA.

But who doesn’t want to learn the proper way learns the hard way.


Photo by Kenny Eliason on Unsplash

After I got into the step of learning the model, to my surprise it showed extremely bad performance. And this fact is not a problem itself, the problem is that I know the model worked well on the same variables I pulled from the other API before I made a switch.

In total the dataset contains about 40 columns and it took me an enormous amount of time to roll back to every calculation, test it trying to find where the mistake was made. In the end, I fixed it but if I would do the homework and check / plot / compare the data, this would never happen!

V. EDA OF THE FINAL RESULTS HAS EQUAL IMPORTANCE

The assumption is that once we fine-tuned the model and achieved an acceptable level of performance metrics (call it Recall, precision, F1 etc) we can push our model to production.

But I would recommend checking whether the model complies with all the assumptions you had before creating it, especially if the model's internal computation is more difficult to interpret than a decision tree.

In my case, the assumption was that once my creation goes live, I will get signals to buy a specific stock in a timely manner, let's say once per 2 weeks. Based on this idea I imagined it as non-frequent trading with up to 10 trades per month.

But after a month of going live, I did not get any signal. That made me investigate what is going on.

When I started to go deeper into the details, I found that the signals were far from evenly distributed over a time period. Sometimes it might be a signal once per year, sometimes it might be a row of signals 2 months straight and sometimes, it may come from several companies in one day.

That means that real-world values do not support a financial strategy I had before. And from a financial perspective, it means that I can’t reinvest let’s say 10K$ every month and see how it grows, but I need to have a much bigger pile of free cash and wait till the moment comes.


ABC Stock Price and Trading signals

In the picture above you can see a stock price (green), the signals to buy (blue) and the shifted by 30 days stock price (orange) that represents the value the stock is theoretically sold.