This is not going to make me rich

I finished the DAT203.1 data science essentials course at edx and went through the whole process from uploading data to publishing a predictive model via a web-service. Pretty cool.

But the textbook examples always work, right?

Ever had a crazy idea? What if you could use this technology to predict the stock market?

So let’s see, what do I want?

I want to be able to sleep at night!

Because really, do you want to risk even 10K $ on a black-box algorithm that you cooked up one late evening after taking an introductory course in data science? Not quite.

So, I want to be in the market, only when the market is open. Ideally, I’d like to be able to predict whether a stock is going to finish higher or lower than it’s open.

And let’s not get too carried away. A prediction of up/down will do for now. I’ll be happy with a classifier that tell me whether the stock is going to finish in the black, or in the red, at the end of the day.

Ooooh, I could even use 4X margin! So even if the stock only moves 0.05%, on a margin account that would still translate to a 0.2% profit. Let’s assume there are roughly 250 trading days in one year… I’m betting on a minimum return of roughly 65%. On my 10K principal, that 6.5K to take the kids to Disney at least twice, I figure.

So far the hypotheticals. And boy was I wrong about those hypotheticals.

But let’s go through the experiment anyway.

I started by getting the historical daily returns of the SPY S&P500 ETF tracker. From there, I computed the daily intraday return (close – open) / open.

Because I’m merely interested in a binary up or down prediction though, another column JWhite was added that’s true when the intraday return was up, and false when the intraday return was down.

Add a couple of technical indicators of the previous days leading up to the session, and we’re good to go.

Oh, wait, about those indicators… I remember Bollinger writing something about it being better to rank indicator values, rather than work with their absolute values instead. So I converted all absolute technical indicators to their 50-day ranked values.

Eventually, my dataset looks like this:

So now I can build my experiment to build and evaluate a model to predict JWhite, and what do you expect to see?

Ehm, yeah, right… I got a perfect prediction from the first attempt?

Can anybody tell me what I did wrong?

The problem is actually that I have the perfect predictor to JWhite within my dataset. I included both the intraday return, as well as the JWhite variable.

So let’s add a feature selector to the experiment that gets rid of the daily returns:

What do we get now?

Hmm, a diagonal this time. Looks like I’m about as good as a coin toss. Pretty bad actually. This is truly about the worst outcome one can have, as a curve below the diagonal at least could be reversed to get some results.

Well, it looks like this technology isn’t going to make me rich, but the exercise taught me a few more things about data science at least.

So did you ever try anything like this? What are your most hilarious failures in data science? Let’s get a conversation started in the comments below!