This is not going to make me rich

I finished the DAT203.1 data science essentials course at edx and went through the whole process from uploading data to publishing a predictive model via a web-service. Pretty cool.

But the textbook examples always work, right?

Ever had a crazy idea? What if you could use this technology to predict the stock market?

So let’s see, what do I want?

I want to be able to sleep at night!

Because really, do you want to risk even 10K $ on a black-box algorithm that you cooked up one late evening after taking an introductory course in data science? Not quite.

So, I want to be in the market, only when the market is open. Ideally, I’d like to be able to predict whether a stock is going to finish higher or lower than it’s open.

And let’s not get too carried away. A prediction of up/down will do for now. I’ll be happy with a classifier that tell me whether the stock is going to finish in the black, or in the red, at the end of the day.

Ooooh, I could even use 4X margin! So even if the stock only moves 0.05%, on a margin account that would still translate to a 0.2% profit. Let’s assume there are roughly 250 trading days in one year… I’m betting on a minimum return of roughly 65%. On my 10K principal, that 6.5K to take the kids to Disney at least twice, I figure.

So far the hypotheticals. And boy was I wrong about those hypotheticals.

But let’s go through the experiment anyway.

I started by getting the historical daily returns of the SPY S&P500 ETF tracker. From there, I computed the daily intraday return (close – open) / open.

Because I’m merely interested in a binary up or down prediction though, another column JWhite was added that’s true when the intraday return was up, and false when the intraday return was down.

Add a couple of technical indicators of the previous days leading up to the session, and we’re good to go.

Oh, wait, about those indicators… I remember Bollinger writing something about it being better to rank indicator values, rather than work with their absolute values instead. So I converted all absolute technical indicators to their 50-day ranked values.

Eventually, my dataset looks like this:

So now I can build my experiment to build and evaluate a model to predict JWhite, and what do you expect to see?

Ehm, yeah, right… I got a perfect prediction from the first attempt?

Can anybody tell me what I did wrong?

The problem is actually that I have the perfect predictor to JWhite within my dataset. I included both the intraday return, as well as the JWhite variable.

So let’s add a feature selector to the experiment that gets rid of the daily returns:

What do we get now?

Hmm, a diagonal this time. Looks like I’m about as good as a coin toss. Pretty bad actually. This is truly about the worst outcome one can have, as a curve below the diagonal at least could be reversed to get some results.

Well, it looks like this technology isn’t going to make me rich, but the exercise taught me a few more things about data science at least.

So did you ever try anything like this? What are your most hilarious failures in data science? Let’s get a conversation started in the comments below!

A first project

The edX track that I’m currently following is sponsored by Microsoft. No particular reason. There are many distance learning options out there, and the combination of MIT, Harvard, and Microsoft seemed like reasonable credentials.

But courses are only as good as to the extent that you can apply them. So let’s see if I actually learned something useful so far.

The start-up which I’m involved in, Pathomation, has developed software for digital microscopy (and by extension also pathology and histology). This software is used at the Free Brussels University (VUB).

As part of my job as the VUB’s digital pathology manager, we built an interactive web-portal portal for their diabetes biobank.

Long story short: now that we have (some of the) data online, we want to go further and integrate with the back-end.

The thesis is simple: for workflow and sample tracking, the laboratory uses SLims (a Biobank Information Management Systems) by Genohm. It is on this end that Pathomation now also wants to offer slide visualization services.

In order to do that, Genohm has asked for a test-set. The test-set should contain sample information, in addition to what slides are associated with each sample. The sample information is contained within a Microsoft SQLServer database, and the slides are hosted within Pathomation’s PMA.core.

A few weeks ago, I would have messed around with Microsoft Excel (or Open Office Calc), but I recently learned about the really cool “data merge” capabilities in Azure ML Studio.

So the plan consists of three steps:

Here’s the SQLServer query:

Which after exporting to text still leads a bit of tweaking (SQLServer apparently didn’t include the column names):

And here’s the PHP script that retrieves the hosted slides from PMA.core:

Next, I uploaded both outputs as new datasets into Azure ML Studio (make sure that you use the “,” (comma) as a separator character; Azure ML only operates on CSV files that a (true to their name) COMMA separated):

And then I created an experiment that joins both datasets:

Based on a common Sample identifier:

The final step is to extract the new combined dataset:

Note that the final field is now a delimited list in its own right (but those were the customer requirements: a single CSV file with one row representing one sample, and a single field in which all (many) slide references were contained).

And there you have it. Admittedly this is not the fanciest data science project ever done, but it still shows how you can very elegantly solve everyday problems. Problems that only a few weeks ago would have been much more complicated for me to solve (What scripting language to use? Use a database or spreadsheet? Etc.)

By the way, it is not my intent to tout Microsoft technology in this blog. I could have looked for another track. I could have gone for Datacamp as an alternative. But my problem with Datacamp is that it seems to have their starting point somewhat wrong: it’s very programming language focused. You either decide you want to do data science with R, or with Python, and that’s it. And once you’re in that track, you’re stuck with the libraries they’re offering you on that platform in that language. Rather myopic, if you ask me. Plus I’ve always never liked R, and don’t know much about Python besides the few scripts that I had the developers at Pathomation write for me. I know one thing: throughout my career I’ve written code in any number of programming languages now, and I find it best to pick the language based on the problem you’re trying to solve, not the other way around. But I digress…

Regardless; I’m really excited about Azure ML Studio. It’s not only a way to do machine learning, but appears to be a versatile toolkit for data handling.

So, do you agree with my approach? Or did I overcomplicate things? Would you have done this differently? Leave a comment below.