A first project

The edX track that I’m currently following is sponsored by Microsoft. No particular reason. There are many distance learning options out there, and the combination of MIT, Harvard, and Microsoft seemed like reasonable credentials.

But courses are only as good as to the extent that you can apply them. So let’s see if I actually learned something useful so far.

The start-up which I’m involved in, Pathomation, has developed software for digital microscopy (and by extension also pathology and histology). This software is used at the Free Brussels University (VUB).

As part of my job as the VUB’s digital pathology manager, we built an interactive web-portal portal for their diabetes biobank.

Long story short: now that we have (some of the) data online, we want to go further and integrate with the back-end.

The thesis is simple: for workflow and sample tracking, the laboratory uses SLims (a Biobank Information Management Systems) by Genohm. It is on this end that Pathomation now also wants to offer slide visualization services.

In order to do that, Genohm has asked for a test-set. The test-set should contain sample information, in addition to what slides are associated with each sample. The sample information is contained within a Microsoft SQLServer database, and the slides are hosted within Pathomation’s PMA.core.

A few weeks ago, I would have messed around with Microsoft Excel (or Open Office Calc), but I recently learned about the really cool “data merge” capabilities in Azure ML Studio.

So the plan consists of three steps:

Here’s the SQLServer query:

Which after exporting to text still leads a bit of tweaking (SQLServer apparently didn’t include the column names):

And here’s the PHP script that retrieves the hosted slides from PMA.core:

Next, I uploaded both outputs as new datasets into Azure ML Studio (make sure that you use the “,” (comma) as a separator character; Azure ML only operates on CSV files that a (true to their name) COMMA separated):

And then I created an experiment that joins both datasets:

Based on a common Sample identifier:

The final step is to extract the new combined dataset:

Note that the final field is now a delimited list in its own right (but those were the customer requirements: a single CSV file with one row representing one sample, and a single field in which all (many) slide references were contained).

And there you have it. Admittedly this is not the fanciest data science project ever done, but it still shows how you can very elegantly solve everyday problems. Problems that only a few weeks ago would have been much more complicated for me to solve (What scripting language to use? Use a database or spreadsheet? Etc.)

By the way, it is not my intent to tout Microsoft technology in this blog. I could have looked for another track. I could have gone for Datacamp as an alternative. But my problem with Datacamp is that it seems to have their starting point somewhat wrong: it’s very programming language focused. You either decide you want to do data science with R, or with Python, and that’s it. And once you’re in that track, you’re stuck with the libraries they’re offering you on that platform in that language. Rather myopic, if you ask me. Plus I’ve always never liked R, and don’t know much about Python besides the few scripts that I had the developers at Pathomation write for me. I know one thing: throughout my career I’ve written code in any number of programming languages now, and I find it best to pick the language based on the problem you’re trying to solve, not the other way around. But I digress…

Regardless; I’m really excited about Azure ML Studio. It’s not only a way to do machine learning, but appears to be a versatile toolkit for data handling.

So, do you agree with my approach? Or did I overcomplicate things? Would you have done this differently? Leave a comment below.

 

Welcome to a new blog

How do you start a blog? By introducing yourself. So: My name is Yves Sucaet, and I’m the Chief Technology Officer of Pathomation. As the top tech guy, one enjoys certain liberties. One of those, is using corporate resources to host my own blog.

Why a blog? Why now? I’ve thought about it many times before certainly. Now seems a particularly good time however, as I find myself about to embark on a new journey: data science.

Why data science? Because it is hot? Because it is cool? Because it is hype? All of the above, probably. Yes, data science is undoubtedly a hot field, and as a software tech company’s CTO, it’s only my duty to go where the next big opportunities lay. Is data science cool? I’m not the one to say. Depends on your definition of cool, I guess. Is it a hype? Most definitely. But that need not be a bad thing necessarily. It merely means that there’s somewhat more fog out there than usual to cut to the core of a topic and understand it.

I like to think that I come into this unbiased. Also keep in mind that I cannot absorb all knowledge about data science at once; what you’ll read here will eventually be somewhat biased, as it will be influenced by the tools I pick up along the way (more about that in a subsequent post).

I had tried picking up the subject before, but honestly failed miserably. Because of the hype. Because of the clutter. Because of the learning curve of some of the tools out there. Because it’s not fun trying to wrestle yourself through a 300-or-so page book, only to find a list of 3000 other references you should REALLY read to get more background.

What’s changed? I think I’ve finally come across a platform that can help me on my way. I’ve typically been skeptical about online learning (University of Phoenix, Trump University…), but edx.org seemed to be one checking out. It’s backed by some major names in education, including Harvard and MIT. Combine that with a specific “track” (and not just random incoherent tutorials about the next great framework) to get you on your way, and I decided to give it a try.

One of the first things discussed in the course is reaching out. To talk about what you do and how you do it. So this is blog my attempt of contributing to the community, and to take you with my on my path so perhaps we can learn together.

So what else is there to know besides the fact that I’m looking into data science to expand our company’s reach and renew my own personal skillset? I originally trained as a bioinformatician. I completed my graduate work at Iowa State University in 2010. As part of that curriculum, I took a machine learning course in 2007. That was fun, but hard, and tedious. I went through the statistics. I played with the Weka toolkit. As a project for the course, I remember working on k-mer pattern clustering in amino acid sequence strings. The algorithm worked, but one had to do such an extraordinary amount of work, that I didn’t find to subject interesting enough to continue with. I understand that the people that did continue into this field, eventually put their heads together and built the easily configurable back-ends that we have today. Throw in some HCI-people and you get user-friendly interfaces to control those back-ends, too.

Machine learning anno 2007.

Machine learning anno 2017.

Here’s the bottom line: I don’t come into this blindly. I don’t come into this naively either, I believe. I’ve certainly got baggage. Good baggage, I think. And I’m able and willing to travel. And I’m looking forward to meeting new people on my journeys. So leave a comment if you feel so.