How to test and benchmark a tile server?

As mentioned before, I’m the Chief Technology Officer of Pathomation. Pathomation offers a platform of software components for digital pathology. We have a YouTube video that explains the whole thing.

You can try a local desktop-bound (some say “chained”) of our software at People tell us our performance is pretty good, which is always nice to hear. The problem is: can we objectively “prove” that we’re fast, too?

The core components of our component suite is aptly called “PMA.core” (we’re developers, not creative namegivers obviously). Conceptually, PMA.core a slide tile server. Simply put, a tile server serves up data in regularised square shaped portions called tiles. In the case of PMA.core, tiles are extracted in real time on an as-needed basis from selected whole slide images.  

So how do you then test tile extraction performance?

At present, I can see three different ways:

  1. On a systematic basis, going through all hypothetical tiles one by one, averaging the time it takes to render each one.
  2. On a random basis
  3. Based on a historical trail of already heavily viewed images.

Each of these methods have their pros and cons, and it depends on what kind of property of the tile server you want to test in the first place.

Systematic testing

The pseudo-code for this one is straightforward:

For x in (0..max_number_of_horizontal_tiles):
  For y in (0..max_number_of_vertical_tiles)
    Extract tile at position (x, y)

However, we’re talking about whole slide image files here, which have more than just horizontal and vertical dimensions. Images are organized as a hierarchical, pyramid-structured stack, and can also contain z-levels, fluorescent layers, or even timelapse data. So the complete loop for systematic testing goes more like this:

For t in (0..max_timeframes):
  For z in (0..max_z_stacks):
    For l in (0..max_zoomlevels):
      For c in (0..max_channels):
        For x in (0..max_number_of_horizontal_tiles):
          For y in (0..max_number_of_vertical_tiles):
            Extract tile at timeframe t, z-stack z, zoomlevel l, channel c, position (x, y)

But that’s just nested looping; nothing fancy about this, really. We’ve been using this method of testing for as long as we can remember pretty much, and even wrapped our own internal tool around this, (again very aptly) called the profiler.

What’s good about this systematic tile test extraction method?

  • Easy to understand
  • Complete coverage; gives an accurate impression of what effort is needed to re construct the entire slide
  • Comparison between file formats (as long as they have similar zoomlevels, z-stacks, channels etc.) allow for benchmarking

What’s bad about this extraction method?

  • It’s unrealistic. Users never navigate through a slide tile by tile.
  • Considering the ratio of the data being extracted from different dimensions that can occur in a slide, you end up over-sampling some dimensions, while under-sampling others. Again this results in a number that, while accurate, is purely hypothetical, and doesn’t do a good job at illustrating the end-user’s experience.
  • In reality, end-users are only presented with a small percentage of the complete “universe” of tiles present in a slide. Ironically, the least interesting tiles will take the smallest amount of effort to send back (especially in terms of bandwidth, like “blank” tiles containing mostly whitespace on a slide or lumens within a specimen etc.)

Random testing

In random testing, we extract a pre-determined (either fixed number or percentage of total number of total available tiles). The pseudo-code is as follows:

Let n = predetermined number of (random) tiles that we want to extract
For i in (0.. n):
  Let t = random (0..max_timeframes)
  Let z = random (0..max_z_stacks)
  Let l = random (0..max_zoomlevels)
  Let c = random (0..max_channels)
  Let x = random (0..max_number_of_horizontal_tiles)
  Let y = random (0..max_number_of_vertical_tiles)
  Extract tile at timeframe t, z-stack z, zoomlevel l, channel c, position (x, y)

The same statistics can be reported back as with systematic testing, in addition to some coverage parameters (based what percentage of total tiles were retrieved).

Let’s look at some of the pros and cons of this one.

Here are the pros:

  • Faster than systematic sampling (see also the “one in ten rule” commonly used in statistics:
  • For deeper zoomlevels that have sufficient data, a more homogenous sampling can be performed (whereas systematic sampling can oversample the deeper zoomlevels, as each deeper zoomlevel contains 4 times more tiles).
  • Certain features in the underlying file format (such as storing neighboring tiles close together) that may unjustly boost the results in systematic sampling are less likely to affect results here.

What about the cons?

  • Smaller coverage may require bootstrapping to get satisfying aggregate results.
  • Random sampling is still unrealistic. Neighboring tiles have less chance of being selected in sequence, while in reality of course any field of view presented to an end-user is the result of compound neighboring tiles
  • Less reliable to compare one file format to the next, as this may again require bootstrapping.

Historic re-sampling

A third method can be devised based on historic trace information for one particular file. A file that’s included in a teaching collection and that’s been online for a while, has been viewed by hundreds or even thousands of users. We found in some of our longer running projects (like at or that students under such conditions typically are presented with the same tiles over and over again. This means that for a given slide that is fairly often explored, we can reconstruct the order in which the tiles for that particular slide are being served to the end-user, and that trace can be replicated in a testing scenario.

In terms of replication, this is then the most accurate way of testing. Apart from that, other advantages exist:

  • This is the best way to measure performance differences across different types of storage media. If for some reason a particular storage medium introduces a performance penalty because of its properties, this is the only reliable way to determine whether that penalty actually matters for whole slide image viewing.
  • For large enough numbers (entries in the historical tracing logs), a “natural” mixture of different tiles in different zoomlevels, channels, and z-stacks will be present. This sequence of tiles presented in the trace history automatically reflects how real users navigate a slide.

However, this method, too, has its flaws:

  • This type of testing and measuring cannot be used until a slide has actually been online for a certain time period and browsed by a large number of end-users.
  • Test results may be affected by the type of user that navigates the slide: we shouldn’t compare historical information about a slide browsed by seasoned pathologists with how novice med school students navigate a different slide. Apples and oranges, you know.
  • Because each slide has its own trace, it become really hard to compare performance between different file formats.
  • Setting up this type of test requires, of course, historical trace information. This means that this test is the most time consuming to set up: IIS logfiles have to be parsed, tile requests have to be singled out, matched to the right whole slide image etc.

Preliminary conclusions

This section came out of discussing the various strategies with Angelos Pappas, one of our software engineers.

The current profiler that we use was built to do the following:

  1. Compare the performance impact of code modifications in PMA.core. For example by changing around a parser class, or by modifying the flow in the core rendering system etc… We needed a way to relatively compare what’s the difference between versions.
  2. Compare the performance when rendering different slide formats. To do this, you need similar slides (dimensions, encoding method and of course pixel contents), stored in different formats. The “CMU-{N}” slides from OpenSlide are a good case, as well as the ones we bring back ourselves from various digital pathology events. This again, allows us to do relative comparisons that will give us hints about why a format is slower than another. Is it our parser that needs improvement? Is it the nature of the format? etc.
  3. Compare the performance of different storage sources, like local storage versus SMB.

The profiler does all of the above nicely and it’s the only way we have to do such measurements. And even though the profiler supports a “random” mode, we hardly ever use it. Pathomation test engineers usually let the profiler run up to a specific percentage or for a specific period and compare the results.

Eventually what you want to accomplish with all this is to get an objective measurements for user experiences. The profiler wasn’t really meant to measure how good the user experience will be. This is a much more complicated matter, as it involves patterns that are very hard to emulate, network issues, etc etc. For example, if a user zooms into a region, the browser fires simultaneous requests for neighboring tiles. If you ever want to do this kind of measurements, perhaps your best bet would be to do this by commanding a browser. Again though, your measurements would give you a relative comparison.

Backup and reboot

A new year starts with good intentions. For me, the new year 2018 in part started with the realization that it’s been forever that I’ve written anything on the RealData blog.

I like writing. I like passing on knowledge. So the most pertinent question to ask then perhaps is: why did I stop?

Let’s start with the obvious one: lack of time. Getting content out in the right format is hard. It’s one thing to jot down some notes in a diary (a jupyter notebook perhaps); it’s quite another to deliver a publishable readable blog-entry.

And after all: what’s the point, right? Who reads this anyway? You? Why? Do I even have anything to tell you?

I think my journey is still worth sharing. So let’s look at some of the things that went wrong and contributed to my preliminary failure:

I slacked off on the online courses I was planning on taking back in April 2017. Perhaps even worse: I wasn’t passing anymore. At least not as easily anymore as I did in the beginning. Why was that?

I’m now convinced that taking online courses is an art in itself. It’s really easy to go to, or Coursera, or Datacamp, or any other platform and have the intention of “Let’s take every data science course there is and become a data scientist”. They all look so interesting, right?

I remember passing the Datacamp course Introduction to Python for data science (aka “Python for beginners”) in two days, with a 99% final score.

Woot! I know Python (or so I thought)

I started struggling really bad taking the advanced Programming with Python for data science course. Again: why? Because anyone who’s been programming for over 25 years can probably pass any beginning programming course in any language. I was forgetting that I’m pretty proficient in C# today because I have been using it ever since the very first European .Net conference in Copenhagen, Denkmark in 2001.

In order to become a proficient Python programmer, I have to start using the language myself in daily tasks. An obvious one, I’m afraid, but saying it is easier than doing it. There’s a level of humility attached to this: it’s hard for me to spend half an hour trying to figure out how to process something as trivial as a text file in Python, when I *know* I can write the same script in PHP in 5 minutes. Hey, I can probably do it with a console application in C#, too!

A few years ago, I was mentoring someone into software development. The person was highly educated, had some scripting experience with Excel VBA, and seemed ready for the next level. At one point I was looking something up for her on StackOverflow when she commented “oh, so programming is just copy/paste, right?”. Well, yes and no. Mostly no, of course. But websites like StackOverflow can make it look that way. And sometimes we fool ourselves into thinking that it really has become that easy. It hasn’t

At around the same time that I was flunking my advanced Python curriculum, software testing was rapidly become a top-priority at Pathomation. Huzzah, also appeared to have a “micromasters” track in software testing. This track consisted of only 3 courses, which seemed more manageable than the entire data science track.

I took the first course and passed. I should point out that these are academically organized courses, considered to be graduate level, and passing them is actually somewhat tough: you have to get an 80% in order to get the certificate.

Unfortunately, the testing curriculum went the same way as the data science curriculum I signed up. I passed the first course; but slacked off halfway during the second course. There’s only so much coursework your brain can process in any given timeperiod, too.

Half full or half empty? I passed the first course, but never got around to taking the other two

They say that good intentions have a higher success rate of being realized when you write them down. So, here’s my intention for 2018: I’ve backed up a little and given it some consideration of why I failed in my attempts last year.

So let’s take a reboot now and set out to complete the following courses and education tracks in 2018:

Curious to see if I’ll make it this time? Keep following this blog, then!


This is not going to make me rich

I finished the DAT203.1 data science essentials course at edx and went through the whole process from uploading data to publishing a predictive model via a web-service. Pretty cool.

But the textbook examples always work, right?

Ever had a crazy idea? What if you could use this technology to predict the stock market?

So let’s see, what do I want?

I want to be able to sleep at night!

Because really, do you want to risk even 10K $ on a black-box algorithm that you cooked up one late evening after taking an introductory course in data science? Not quite.

So, I want to be in the market, only when the market is open. Ideally, I’d like to be able to predict whether a stock is going to finish higher or lower than it’s open.

And let’s not get too carried away. A prediction of up/down will do for now. I’ll be happy with a classifier that tell me whether the stock is going to finish in the black, or in the red, at the end of the day.

Ooooh, I could even use 4X margin! So even if the stock only moves 0.05%, on a margin account that would still translate to a 0.2% profit. Let’s assume there are roughly 250 trading days in one year… I’m betting on a minimum return of roughly 65%. On my 10K principal, that 6.5K to take the kids to Disney at least twice, I figure.

So far the hypotheticals. And boy was I wrong about those hypotheticals.

But let’s go through the experiment anyway.

I started by getting the historical daily returns of the SPY S&P500 ETF tracker. From there, I computed the daily intraday return (close – open) / open.

Because I’m merely interested in a binary up or down prediction though, another column JWhite was added that’s true when the intraday return was up, and false when the intraday return was down.

Add a couple of technical indicators of the previous days leading up to the session, and we’re good to go.

Oh, wait, about those indicators… I remember Bollinger writing something about it being better to rank indicator values, rather than work with their absolute values instead. So I converted all absolute technical indicators to their 50-day ranked values.

Eventually, my dataset looks like this:

So now I can build my experiment to build and evaluate a model to predict JWhite, and what do you expect to see?

Ehm, yeah, right… I got a perfect prediction from the first attempt?

Can anybody tell me what I did wrong?

The problem is actually that I have the perfect predictor to JWhite within my dataset. I included both the intraday return, as well as the JWhite variable.

So let’s add a feature selector to the experiment that gets rid of the daily returns:

What do we get now?

Hmm, a diagonal this time. Looks like I’m about as good as a coin toss. Pretty bad actually. This is truly about the worst outcome one can have, as a curve below the diagonal at least could be reversed to get some results.

Well, it looks like this technology isn’t going to make me rich, but the exercise taught me a few more things about data science at least.

So did you ever try anything like this? What are your most hilarious failures in data science? Let’s get a conversation started in the comments below!

A first project

The edX track that I’m currently following is sponsored by Microsoft. No particular reason. There are many distance learning options out there, and the combination of MIT, Harvard, and Microsoft seemed like reasonable credentials.

But courses are only as good as to the extent that you can apply them. So let’s see if I actually learned something useful so far.

The start-up which I’m involved in, Pathomation, has developed software for digital microscopy (and by extension also pathology and histology). This software is used at the Free Brussels University (VUB).

As part of my job as the VUB’s digital pathology manager, we built an interactive web-portal portal for their diabetes biobank.

Long story short: now that we have (some of the) data online, we want to go further and integrate with the back-end.

The thesis is simple: for workflow and sample tracking, the laboratory uses SLims (a Biobank Information Management Systems) by Genohm. It is on this end that Pathomation now also wants to offer slide visualization services.

In order to do that, Genohm has asked for a test-set. The test-set should contain sample information, in addition to what slides are associated with each sample. The sample information is contained within a Microsoft SQLServer database, and the slides are hosted within Pathomation’s PMA.core.

A few weeks ago, I would have messed around with Microsoft Excel (or Open Office Calc), but I recently learned about the really cool “data merge” capabilities in Azure ML Studio.

So the plan consists of three steps:

Here’s the SQLServer query:

Which after exporting to text still leads a bit of tweaking (SQLServer apparently didn’t include the column names):

And here’s the PHP script that retrieves the hosted slides from PMA.core:

Next, I uploaded both outputs as new datasets into Azure ML Studio (make sure that you use the “,” (comma) as a separator character; Azure ML only operates on CSV files that a (true to their name) COMMA separated):

And then I created an experiment that joins both datasets:

Based on a common Sample identifier:

The final step is to extract the new combined dataset:

Note that the final field is now a delimited list in its own right (but those were the customer requirements: a single CSV file with one row representing one sample, and a single field in which all (many) slide references were contained).

And there you have it. Admittedly this is not the fanciest data science project ever done, but it still shows how you can very elegantly solve everyday problems. Problems that only a few weeks ago would have been much more complicated for me to solve (What scripting language to use? Use a database or spreadsheet? Etc.)

By the way, it is not my intent to tout Microsoft technology in this blog. I could have looked for another track. I could have gone for Datacamp as an alternative. But my problem with Datacamp is that it seems to have their starting point somewhat wrong: it’s very programming language focused. You either decide you want to do data science with R, or with Python, and that’s it. And once you’re in that track, you’re stuck with the libraries they’re offering you on that platform in that language. Rather myopic, if you ask me. Plus I’ve always never liked R, and don’t know much about Python besides the few scripts that I had the developers at Pathomation write for me. I know one thing: throughout my career I’ve written code in any number of programming languages now, and I find it best to pick the language based on the problem you’re trying to solve, not the other way around. But I digress…

Regardless; I’m really excited about Azure ML Studio. It’s not only a way to do machine learning, but appears to be a versatile toolkit for data handling.

So, do you agree with my approach? Or did I overcomplicate things? Would you have done this differently? Leave a comment below.


Welcome to a new blog

How do you start a blog? By introducing yourself. So: My name is Yves Sucaet, and I’m the Chief Technology Officer of Pathomation. As the top tech guy, one enjoys certain liberties. One of those, is using corporate resources to host my own blog.

Why a blog? Why now? I’ve thought about it many times before certainly. Now seems a particularly good time however, as I find myself about to embark on a new journey: data science.

Why data science? Because it is hot? Because it is cool? Because it is hype? All of the above, probably. Yes, data science is undoubtedly a hot field, and as a software tech company’s CTO, it’s only my duty to go where the next big opportunities lay. Is data science cool? I’m not the one to say. Depends on your definition of cool, I guess. Is it a hype? Most definitely. But that need not be a bad thing necessarily. It merely means that there’s somewhat more fog out there than usual to cut to the core of a topic and understand it.

I like to think that I come into this unbiased. Also keep in mind that I cannot absorb all knowledge about data science at once; what you’ll read here will eventually be somewhat biased, as it will be influenced by the tools I pick up along the way (more about that in a subsequent post).

I had tried picking up the subject before, but honestly failed miserably. Because of the hype. Because of the clutter. Because of the learning curve of some of the tools out there. Because it’s not fun trying to wrestle yourself through a 300-or-so page book, only to find a list of 3000 other references you should REALLY read to get more background.

What’s changed? I think I’ve finally come across a platform that can help me on my way. I’ve typically been skeptical about online learning (University of Phoenix, Trump University…), but seemed to be one checking out. It’s backed by some major names in education, including Harvard and MIT. Combine that with a specific “track” (and not just random incoherent tutorials about the next great framework) to get you on your way, and I decided to give it a try.

One of the first things discussed in the course is reaching out. To talk about what you do and how you do it. So this is blog my attempt of contributing to the community, and to take you with my on my path so perhaps we can learn together.

So what else is there to know besides the fact that I’m looking into data science to expand our company’s reach and renew my own personal skillset? I originally trained as a bioinformatician. I completed my graduate work at Iowa State University in 2010. As part of that curriculum, I took a machine learning course in 2007. That was fun, but hard, and tedious. I went through the statistics. I played with the Weka toolkit. As a project for the course, I remember working on k-mer pattern clustering in amino acid sequence strings. The algorithm worked, but one had to do such an extraordinary amount of work, that I didn’t find to subject interesting enough to continue with. I understand that the people that did continue into this field, eventually put their heads together and built the easily configurable back-ends that we have today. Throw in some HCI-people and you get user-friendly interfaces to control those back-ends, too.

Machine learning anno 2007.

Machine learning anno 2017.

Here’s the bottom line: I don’t come into this blindly. I don’t come into this naively either, I believe. I’ve certainly got baggage. Good baggage, I think. And I’m able and willing to travel. And I’m looking forward to meeting new people on my journeys. So leave a comment if you feel so.