The best kind of criticism

Nothing like honest feedback. I was so happy we finally got something cooking in Pyhon, that I sent it out to couple of people, asking (hoping) for immediate feedback. I failed to provide the full context of PMA.python though. Therefore, here’s one response that I got back:

So there’s obviously a couple of assumptions (leading to misconceptions) about PMA.python that are really easy to make, specifically:

It’s a universal library for whole slide imaging; moreover it’s a self-contained library that doesn’t depend on anything else besides Python itself
PMA.start is a dependency, which is hard to install and set up.
The most important PMA.python dependency, PMA.start, runs on Windows as well as Linux (or Mac)

Lack of information leads to assumptions. Assumptions lead to misconceptions. Misconceptions lead to disappointment. Disappointment leads to anger. Anger leads to hate. Hate leads to suffering.

Aaargh, we single-handedly just tipped the balance of power in the Universe!

But on more serious note:

I hope we successfully prevented people from bumping into 1) by providing additional verbiage in our package description on PyPI:

As for 2) we don’t think this is the case: Downloading a .exe-file and running it is about as easy as we can make it for you. Due to the nature of the software however, the user that installs the software will need administrative access to his or her own computer.

Finally, there 3). We know and we acknowledge the existence of the Linux-community. Unfortunately, porting PMA.start to Linux is not a trivial thing to do. And so, for now, we don’t. We don’t have the resources for that, and we’d rather offer stable software for a single OS, than an average and mediocre product that nevertheless runs on both. If you’re tied to Linux and require WSI processing with Python, you’ll have to stick with OpenSlide for now (which also requires a separate library to connect to; there simply isn’t a native WSI library for Python).

By the way: the argument actually only partially applies. We offer a commercial version of PMA.core that is not limited to just listening to incoming requests on localhost. PMA.python then can be installed on a Linux datacenter and be used to communicate with the PMA.core instance that hosts the digital slide / tile data. In fact, this is exactly the approach that is being used at CellCarta, Pathomation‘s parent company.

In closing

That being said, we really appreciated the honesty of the poster. The point is: while it’s great to receive positive praise (PMA.start now has about 150 users per month; these can’t all be wrong), it’s the negative criticism that helps you become better and prevents you from becoming complacent.

May 12, 2018June 13, 2018

Update on testing

A performance conundrum

In a previous blog post, we showed how our own software compares to the legacy OpenSlide library. We explained our testing strategy, and did a statistical analysis of the file formats. We analyzed the obtained data for two out of five common WSI file formats.

We subsequently ran the tests for three more formats, but ended up with somewhat of a conundrum. Below are the box and whisker plots for the Leica SCN file format, as well as the 3DHistech MRXS file format. The chart suggests OpenSlide is far superior.

The differences are clear; but at the time so outspoken, that it was unlikely just due to the implementation, especially in light of our results. Case in point: the SCN file format differs little from the SVS format (they’re move adapted and renamed variants of the generic TIFF format).

So we started looking at a higher level of the SCN file that we were using for our analysis. Showing the slide’s meta-data in PMA.core 1, PMA.core 1 + OpenSlide, and PMA.core 2. As PMA.core 1 and PMA.core 2 show similar performance, we only include PMA.core 2 (on the right) in the screenshot below.

What happens is that PMA.core 2 only identifies the tissue on the glass, while OpenSlide not only identifies the tissue, but also builds a virtual whitespace area around the slide. This representation of the slide in OpenSlide is therefore much larger than the representation of slide in PMA.core (even though we’re talking about the same slide). The pixel space is with OpenSlide is almost 6 times bigger (5.76 times to be exact); but all of that extra space are just empty tiles.

Now when we’re doing random sampling of tiles, we select from the entire pixel space. So, by chance, when selecting random tiles from the OpenSlide rendering engine, we’re much more likely to end up with an empty tile, than with PMA.core. Tiles get send back to the client in JPEG files, and blank tiles can be compressed and reduced in size much more than actual tissue tiles. We looked at the histogram and cumulative frequency distribution of pixel values in our latest post.

The same logic applies to the MRXS format: A side by side comparison also reveals that the pixel space in the OpenSlide build is much larger than in PMA.core.

The increased chance of ending up with a blank tile instead of real tissue doesn’t just explain the difference in mean retrieval times, it also explains the difference in variance that can be seen in the box and whisker plots.

Is_tissue() to the rescue

But wait, we just wrote that is_tissue() function earlier, that let’s us detect whether an obtained tile is blank or has tissue in it, right? Let’s use that one to see if a tile is whitespace or tissue. If it is tissue, we count the result in our statistic, it it’s not, we leave it out, and continue until we do hit a tissue tile that is worthy of inclusion.

To recap, here’s the code to our is_tissue() function (which we can’t condone for general use, but it’s good enough to get the job done here):

def is_tissue(tile):
    pixels = np.array(tile).flatten()        # np refers to numpy
    mean_threahold = np.mean(pixels) < 192   # 75th percentile
    std_threshold = np.std(pixels) < 75
    return mean_threahold == True and std_threshold == True

And our code for tile retrieval becomes like this:

def get_tiles(server, user, pwd, slide, num_trials = 100):
    session_id = pma.connect(server, user, pwd)
    print ("Obtained Session ID: ", session_id)
    info = pma.get_slide_info(slide, session_id)
    timepoints = []
    zl = pma.get_max_zoomlevel(slide, session_id)
    print("\t+getting random tiles from zoomlevel ", zl)
    (xtiles, ytiles, _) = pma.get_zoomlevels_dict(slide, session_id)[zl]
    total_tiles_requested = 0
    while len(timepoints) <= num_trials:
        xtile = randint(0, xtiles)
        ytile = randint(0, ytiles)
        print ("\t+getting tile ", len(timepoints), ":", xtile, ytile)
        s_time = datetime.datetime.now()    # start_time
        tile = pma.get_tile(slide, xtile + x, ytile + y, zl, session_id)
        total_tiles_requested = total_tiles_requested + 1
        if is_tissue(tile):
            timepoints.append((datetime.datetime.now() - s_time).total_seconds())
    pma.disconnect(session_id)
    print(total_tiles_requested, “ number of total tiles requested”)
    return timepoints

Note that at the end of our test function, we print out how many tiles we actually had to retrieve from each slide, in order to reach a sampling size that matches the trial_size parameter.

Running this on the MRXS sample both in our OpenSlide build and PMA.core 2 confirms our hypothesis about the increase in pixel space (and a bias for empty tiles). In our OpenSlide build, we have to weed through 4200 tiles to get a sample of 200 tissue tiles; in PMA.core 2 we “only” require 978 iterations.

Plotting the results as a box and whisker chart, we now get:

We turn to statistics once again to support our hypothesis:

We can do the same for SCN now. First, let’s see if we can indeed use the Is_Tissue() function to on the slide as a filter:

This looks good, and we run the same script as earlier. We notice the same discrepancy of overrepresented blank tiles in the SCN file when using OpenSlide. In our OpenSlide build, we have to retrieve 3111 tiles. Using our own implementation, only 563 tiles are needed.

When we do a box and whisker plot, we confirm the superior performance of PMA.core compared to OpenSlide:

The statistics further confirm the observations of the box and whisker plot (p-value of 0.004 when examining the difference in means):

What does it all mean?

We generated all of the above and previous statistics and test suites to confirm that our technology is indeed faster than the legacy OpenSlide library. We did this on three different versions of PMA.core: version 1.2, a custom build of version 1.2 using OpenSlide where possible, and PMA.core 2.0 (currently under development). We can summarize our findings like this:

Note that we had to modify our testing strategy slightly for MRXS and SCN. The table doesn’t really suggest that it takes twice as long to read an SCN file compare to an SVSlide file. The discrepancies between the different file formats is the result of selecting either only tissue slides (MRXS, SCN), or a random mixture of both tissue and blank tiles.

When requesting tiles in a sequential manner (we’ll come back to parallel data retrieval in the future), the big takeaway number is that PMA.core is on average 11% faster than OpenSlide.

May 9, 2018July 27, 2018

Dude, where’s my tissue?

Whitespace and tissue

Let’s say that you have a great algorithms to distinguish between PDL1- and a CD8-stained cells.

Your first challenge to run this on real data, is to find the actual tiles that contain tissue, rather than just whitespace. This is because your algorithm might be computationally expensive, and therefore you want to run it only on parts of the slide where it makes sense to do the analysis. In order words: you want to eliminate the whitespace.

At our PMA.start website, you can find OpenCV code that automatically outlines an area of tissue on a slide, but in this post we want to tackle things at a somewhat more fundamental level.

Let’s see if we can figure out a mechanism ourselves to distinguish tissue from whitespace.

First, we need to do some empirical exploration. So let’s start by downloading a reference slide from OpenSlide and request its thumbnail through PMA.start:

from pma_python import core
slide = "C:/my_slides/CMU-1.svs"
core.get_thumbnail_image(slide).show()

As we can see, there’s plenty of whitespace to go around. Let’s get a whitespace tile at position (0,0), and a tissue tile at position (max_x / 4, max_y / 2). Intuitively you may want to go for (max_x/2, max_y/2), but that wouldn’t get us anywhere here. So pulling up the thumbnail first is important for this exercise, should you do this on a slide of your own (and your coordinates may vary from ours).

Displaying these tiles two example tiles side by side goes like this:

from pma_python import core
slide = "C:/my_slides/CMU-1.svs"
max_zl = core.get_max_zoomlevel(slide)
dims = core.get_zoomlevels_dict(slide)[max_zl]

tile_0_0 = core.get_tile(slide, x=0, y=0, zoomlevel=max_zl)
tile_0_0.show()

tile_x_y = core.get_tile(slide, x=int(dims[0] / 4), y=int(dims[1] / 2), zoomlevel=max_zl)
tile_x_y.show()

And looks like this:

Now that we have the two distinctive tiles, let’s convert both to grayscale and add a histogram to our script for each:

from pma_python import core
import matplotlib.pyplot as plt
import numpy as np

slide = "C:/my_slides/CMU-1.svs"
max_zl = core.get_max_zoomlevel(slide)
dims = core.get_zoomlevels_dict(slide)[max_zl]

# obtain whitespace tile
tile_0_0 = core.get_tile(slide, x=0, y=0, zoomlevel=max_zl)
# Flatten the whitespace into 1 dimension: pixels
pixels_0_0 = np.array(tile_0_0).flatten()

# obtain tissue tile
tile_x_y = core.get_tile(slide, x=int(dims[0] / 4), y=int(dims[1] / 2), zoomlevel=max_zl)
# Flatten the tissue into 1 dimension: pixels
pixels_x_y = np.array(tile_x_y).flatten()

# Display whitespace image
plt.subplot(2,2,1)
plt.title('Whitespace image')
plt.axis('off')
plt.imshow(tile_0_0, cmap="gray")

# Display a histogram of whitespace pixels
plt.subplot(2,2,2)
plt.xlim((0,255))
plt.title('Whitespace histogram')
plt.hist(pixels_0_0, bins=64, range=(0,256), color="red", alpha=0.4, normed=True)

# Display tissue image
plt.subplot(2,2,3)
plt.title('Tissue image')
plt.axis('off')
plt.imshow(tile_x_y, cmap="gray")

# Display a histogram of whitespace pixels
plt.subplot(2,2,4)
plt.xlim((0,255))
plt.title('Tissue histogram')
plt.hist(pixels_x_y, bins=64, range=(0,256), color="red", alpha=0.4, normed=True)

# Display the plot
plt.tight_layout()
plt.show()

As you can see, we make use of PyPlot subplots here. The output looks as follows, and shows how distinctive the histogram for whitespace is compared to actual tissue areas.

We can add a cumulative distribution function to further illustrate the difference:

# Display whitespace image
plt.subplot(2,3,1)
plt.title('Whitespace image')
plt.axis('off')
plt.imshow(tile_0_0, cmap="gray")

# Display a histogram of whitespace pixels
plt.subplot(2,3,2)
plt.xlim((0,255))
plt.title('Whitespace histogram')
plt.hist(pixels_0_0, bins=64, range=(0,256), color="red", alpha=0.4, normed=True)

# Display a cumulative histogram of the whitespace
plt.subplot(2,3,3)
plt.hist(pixels_0_0, bins=64, range=(0,256), normed=True, cumulative=True, color='blue')
plt.xlim((0,255))
plt.grid('off')
plt.title('Whitespace CDF')

# Display tissue image
plt.subplot(2,3,4)
plt.title('Tissue image')
plt.axis('off')
plt.imshow(tile_x_y, cmap="gray")

# Display a histogram of whitespace pixels
plt.subplot(2,3,5)
plt.xlim((0,255))
plt.title('Tissue histogram')
plt.hist(pixels_x_y, bins=64, range=(0,256), color="red", alpha=0.4, normed=True)

# Display a cumulative histogram of the tissue
plt.subplot(2,3,6)
plt.hist(pixels_x_y, bins=64, range=(0,256), normed=True, cumulative=True, color='blue')
plt.xlim((0,255))
plt.grid('off')
plt.title('Tissue CDF')

Note that we’re now using a 2 x 3 grid to display the histogram and the cumulative distribution function (CDF) next to each other, which ends up looking like this:

Back to basic (statistics)

We could now write a complicated method that manually scans through the pixel-arrays and assesses how many bins there are, how they are distributed etc. However, it is worth noting that the histogram-buckets that are NOT showing any values in the histogram, are not pixels-values of 0; the histogram shows a frequency, therefore, empty spaces in the histogram simply means that NO pixels are available at a given intensity.

So after we visualized the pixel distribution through MatPlotLib’s PyPlot, it’s worth looking as some basic statistics for the flattened pixels arrays. Numpy’s basic .std() and .mean() functions are good enough to do the job for now. The script becomes:

from pma_python import core
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats

slide = "C:/my_slides/CMU-1.svs"
max_zl = core.get_max_zoomlevel(slide)
dims = core.get_zoomlevels_dict(slide)[max_zl]

# obtain whitespace tile
tile_0_0 = core.get_tile(slide, x=0, y=0, zoomlevel=max_zl)
# Flatten the whitespace into 1 dimension: pixels
pixels_0_0 = np.array(tile_0_0).flatten()
# print whitespace statistics
print("Whitespace Mean = ", np.mean(pixels_0_0))
print("Whitespace Std  = ", np.std(pixels_0_0))
print("Whitespace kurtosis = ", scipy.stats.kurtosis(pixels_0_0))
print("Whitespace skewness = ", scipy.stats.skew(pixels_0_0))

# obtain tissue tile
tile_x_y = core.get_tile(slide, x=int(dims[0] / 4), y=int(dims[1] / 2), zoomlevel=max_zl)
# Flatten the tissue into 1 dimension: pixels
pixels_x_y = np.array(tile_x_y).flatten()
# print tissue statistics
print("Tissue Mean = ", np.mean(pixels_x_y))
print("Tissue Std. = ", np.std(pixels_x_y))
print("Tissue kurtosis = ", scipy.stats.kurtosis(pixels_x_y))
print("Tissue skewness = ", scipy.stats.skew(pixels_x_y))

The output (for the same tiles as we visualized using PyPlot earlier is as follows:

The numbers suggest that the mean-(grayscale-)value of a Whitespace tile is a lot higher than that of a tissue tile. More importantly: the standard deviation (variance squared) could be an even more important indicator. Naturally we would expect the standard deviation of (homogenous) whitespace to be much less than that of tissue pictures containing a range of different features.

Kurtosis and skewness are harder to interpret and place, but let’s keep them in there for our further investigation below.

Another histogram

We started this post by analyzing an individual tile’s grayscale histogram. We now summarized that same histogram in a limited number of statistical measurements. In order to determine whether those statistical calculations are an accurate substitute for a visual inspection, we compute those parameters for each tile at a zoomlevel, and then plot those data as a heatmap.

The script for this is as follows. Depending on your fast your machine is, you can play w/ the zoomlevel parameter at the top of the script, or set it to the maximum zoomlevel (at which point you’ll be evaluating every pixel in the slide.

from pma_python import core
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
import sys

slide = "C:/my_slides/CMU-1.svs"
max_zl = 6 # or set to core.get_max_zoomlevel(slide)
dims = core.get_zoomlevels_dict(slide)[max_zl]

means = []
stds = []
kurts = []
skews = []

for x in range(0, dims[0]):
    for y in range(0, dims[1]):
        tile = core.get_tile(slide, x=x, y=y, zoomlevel=max_zl)
        pixels = np.array(tile).flatten()
        means.append(np.mean(pixels))
        stds.append(np.std(pixels))
        kurts.append(scipy.stats.kurtosis(pixels))
        skews.append(scipy.stats.skew(pixels))
    print(".", end="")
    sys.stdout.flush()
print()

means_hm = np.array(means).reshape(dims[0],dims[1])
stds_hm = np.array(stds).reshape(dims[0],dims[1])
kurts_hm = np.array(kurts).reshape(dims[0],dims[1])
skews_hm = np.array(skews).reshape(dims[0],dims[1])

plt.subplot(2,4,1)
plt.title('Mean histogram')
plt.hist(means, bins=100, color="red", normed=True)
plt.subplot(2,4,5)
plt.title('Mean heatmap')
plt.pcolor(means_hm)

plt.subplot(2,4,2)
plt.title('Std histogram')
plt.hist(stds, bins=100, color="red", normed=True)
plt.subplot(2,4,6)
plt.title("Std heatmap")
plt.pcolor(stds_hm)

plt.subplot(2,4,3)
plt.title('Kurtosis histogram')
plt.hist(kurts, bins=100, color="red", normed=True)
plt.subplot(2,4,7)
plt.title("Kurtosis heatmap")
plt.pcolor(kurts_hm)

plt.subplot(2,4,4)
plt.title('Skewness histogram')
plt.hist(skews, bins=100, color="red", normed=True)
plt.subplot(2,4,8)
plt.title("Skewness heatmap")
plt.pcolor(skews_hm)

plt.tight_layout()
plt.show()

The result is a 2 row x 4 columns PyPlot grid, shown below for a selected zoomlevel of 6:

The standard deviation as well as the mean grayscale value of a tile seem accurate indicators to determine the position of where tissue can be found on the slide; skewness and kurtosis not so much.

Both the mean histogram and the standard deviation histogram show a bimodal distribution, and we can confirm their behavior by looking at other slides. It even works on slides that are only lightly stained (we will return to these at a later post and see how we can make the detection better in these:

A binary map

We can now write a function that determines whether a tile is (mostly) tissue, based on the mean and standard deviation of the grayscale pixel values, like this:

def is_tissue(tile):
    pixels = np.array(tile).flatten()
    mean_threahold = np.mean(pixels) < 192   # 75th percentile
    std_threshold = np.std(pixels) < 75
    return mean_threahold == True and std_threshold == True

Writing a similar function that determines whether a tile is (mostly) whitespace, is possible, too, of course:

def is_whitespace(tile):
    return not is_tissue(tile)

Similarly, we can even use this function to now construct a black and white map of where the tissue is located on a slide. The default color map of PyPlot on our system nevertheless rendered the map blue and yellow:

In closing

This post started with “Let’s say that you have a great algorithms to distinguish between PDL1- and a CD8-stained cells.”. We didn’t quite solve that challenge yet. But we did show you some important initial steps you’ll have to take to get to that part eventually.

In the next couple of blogs we want to elaborate on the topic of tissue detection some more. We want to discuss different types of input material, as well as see if we can use a machine learning classifier to replace the deterministic mapping procedure developed here.

But first, we’ll get back to testing, and we’ll tell you why it was important to have the is_tissue() function from the last paragraph developed first.

May 7, 2018September 24, 2018

To OpenSlide or not to OpenSlide

The legacy

When you say digital pathology or whole slide imaging, sooner or later you end up with OpenSlide. This is essentially a programming library. Back in 2012, a pathologist alerted me to it (there’s some irony to the fact that he found it before me, the bioinformatician). He didn’t know what to do with it, but it looked interesting to him.

OpenSlide is how Pathomation got started. We actually contributed to the project.

We also submitted sample files to add to their public repository (and today, we use this repository frequently ourselves, as do many others, I’m sure).

Our names are mentioned in the Acknowledgements section of the OpenSlide paper.

But today, the project has rather slowed down. This is easy to track through their GitHub activity page:

Kudos to the team; they do still make an effort to fix bugs that crop up, today’s but activities seem limited to maintenance. Of course there’s a possibility that nobody gets the code through GitHub anymore, but rather through one of the affiliate projects that they are embedded into.

Consider this though: OpenSlide Discussions about supporting support for (immuno)fluorescence and z-stacking date back to 2012 , but never resulted in anything. Similarly, there are probably about a dozen file formats out there that they don’t support (or flavors that they don’t support, like Sakura’s SVSlide format, which was recently redesigned). We made a table of the formats that we support, and they don’t, at http://free.pathomation.com/pma_start_omero_openslide/

Free at last

Our own software, we’re proud to say, is OpenSlide-free. PMA.start no longer ships with the OpenSlide binaries. The evolution our software has gone through then is as follows (the table doesn’t list all the formats we support; for a full list see http://free.pathomation.com/formats):

At one time, we realized there was really only one file format left that we still ran through OpenSlide, and with the move to cloud storage (more on that in a later post), we decided we might as well make one final effort to re-implement a parser for 3DHistech MRXS slides ourselves.

Profiling

Of course all of this moving away from OpenSlide is useless if we don’t measure up in terms of performance.

So here’s what we did: we downloaded a number of reference slides from the OpenSlide website. Then, we took our latest GAMP5 validated version of our PMA.core 1.2 software, and rerouted it’s slide parsing routines to OpenSlide. This result in a PMA.core 1.2 build that instead of just 2 (Sakura SVSlide and 3DHistech MRXS), now reads 5 different WSI file formats through OpenSlide: Sakura SVSlide, 3DHistech MRXS, Leica SCN, Aperio SVS, and Hamamatsu NDPI.

Our test methodology consist of the following steps:

For each slide:
    Determine the deepest zoomlevel
    From this zoomlevel select 200 random tiles
    Retrieve each tile sequentially
    Keep track of how long it takes to retrieve each tiles

We run this scenario of 3 instances of PMA.core:

PMA.core 1.2 without OpenSlide (except for SVSlide and MRXS)
PMA.core 1.2 custom-build with OpenSlide (for all 5 file formats)
PMA.core 2 beta without OpenSlide

The random tiles selected from the respective slides are different each time. The tiles extracted from slide.ext on PMA.core instance 1 are different from the ones retrieved through PMA.core instance 2.

Results

As simple as the above script sounds, we discovered some oddities while running them for each file format.

Let’s start w/ the two easiest to explain ones: SVS and SVSlide.

A histogram of the recorded times for SVSlide yields the following chart:

We can see visually that PMA.core 1.2 without OpenSlide is a little faster than the PMA.core 1.2 custom-build with OpenSlide, and that we were able to improve this performance yet again in PMA.core 2.

The p-value confirm this. First we do an F-test to determine what T-test we need (p-values of 1.66303E-78 and 2.03369E-05 suggests inequal variances),

Next, we find a p-value of 0.859 (n=200) for the difference in mean tile retrieval time between PMA.core 1.2 with and without OpenSlide, and a p-value of 2.363E-06 for the difference in means between PMA.core 1.2 with OpenSlide and PMA.core 2 without OpenSlide.

We see this as conclusive evidence that PMA.core 2 (as well as PMA.start, which contains a limited version of the PMA.core 2 code in the form of PMA.core.lite) can render tiles faster than OpenSlide can.

What about SVSlide?

Again, let’s start by looking at the histogram:

This is the same trend as we saw for SVS, so let’s see if we can confirm this with statistics. The F-statistic between PMA.core 1.2 with and without OpenSlide yields a p-value of 3.064E-22; between PMA.core 1.2 without OpenSlide and PMA.core 2 we get a p-value of 3.068E-13.

Subsequent T-tests (assuming unequal variance) between PMA.core 1.2 with and without OpenSlide show a p-value of 8.031E-19; between PMA.core 1.2 with OpenSlide and PMA.core 2 we get a p-value of 4.10521E-17 (one-tailed).

Again, we conclude that both PMA.core 1.2 and PMA.core 2 (as well as PMA.start) are faster to render Sakura SVSlide slides than OpenSlide.

What about the others?

We’re still working on getting data for the other file formats. Stay tuned!

April 10, 2018December 7, 2020

Working with digital microscopy imaging data using our Python SDK

Representation

PMA.start is a free desktop viewer for whole slide images. In our previous post, we introduced you to pma_python, a novel package that serves as a wrapper-library and helps interface with PMA.start’s back-end API.

The images PMA.start typically deals with are called whole slide images, so how about we show some pixels? As it turns out, this is really easy. Just invoke the show_slide() call. Assuming you have a slide at c:\my_slides\alk_stain.mrxs, we get:

from pma_python import core
slide =  "C:/my_slides/alk_stain.mrxs"
core.show_slide (slide)

The result depends on whether you’re using PMA.start or a full version of PMA.core. If you’re using PMA.start, you’re taken to the desktop viewer:

If you’re using PMA.core, you’re presented with an interface with less frills: the webbrowser is still involved, but nothing more than scaffolding code around a PMA.UI.View.Viewport is offered (which actually allows for more powerful applications):

Associated images

But there’s more to these images; if you only wanted to view a slide, you wouldn’t bother with Python in the first place. So let’s see what else we can get out of these?

Assuming you have a slide at c:\my_slides\alk_stain.mrxs, you can execute the following code to obtain a thumbnail image representing the whole slide:

from pma_python import pma
slide =  "C:/my_slides/alk_stain.mrxs"
thumb = core.get_thumbnail_image(slide)
thumb.show()

But this thumbnail presentation alone doesn’t give you the whole picture. You should know that a physical glass slide usually consists of two parts: the biggest part of the slide contains the specimen of interest and is represented by the thumbnail image. However, near the end, a label is usually pasted on with information about the slide: the stain used, the tissue type, perhaps even the name of the physician. More recently, oftentimes the label has a barcode printed on it, for easy and automated identification of a slide. The label is therefore sometimes also referred to as “barcode”. Because the two terms are used so interchangeably, we decided to support them in both forms, too. This makes it easier to write code that not only syntactically makes sense, but also applies semantically in your work-environment.

A systematic representation of a physical glass slide can then be given as follows:

The pma_python library then has three methods to obtain slide representations, two of which are aliases of one another:

core.get_thumbnail_image() returns the thumbnail image

core.get_label_image() returns the label image

core.get_barcode_image() is an alias for get_label_image

All of the above methods return PIL Image-objects. It actually took some discussion to figure out how to package the data. Since the SDK wraps around an HTTP-based API, we settled on representing pixels through Pillows. Pillows is the successor to the Python Image Library (PIL). The package should be installed for you automatically when you obtained pma_python.

The following code shows all three representations of a slide side by side:

from pma_python import core
slide =  "C:/my_slides/alk_stain.mrxs"
thumb = core.get_thumbnail_image(slide)
thumb.show()
label = core.get_label_image(slide)
label.show()
barcode = core.get_barcode_image(slide)
barcode.show()

The output is as follows:

Note that not all WSI files have label / barcode information in them. In order to determine what kind of associated images there are, you can inspect a SlideInfo dictionary first to see what’s available:

info = core.get_slide_info(slide)
print(info["AssociatedImageTypes"])

AssociatedImageTypes may refer to more than thumbnail or barcode images, depending on the underlying file format. The most common use of this function is to determine whether a barcode is included or not.

You could write your own function to determine whether your slide has a barcode image:

def slide_has_barcode(slide):
    info = core.get_slide_info(slide)
    return "Barcode" in info["AssociatedImageTypes"]

Tiles in PMA.start

We can access individual tiles within the tiled stack using PMA.start, but before we do that we should first look some more at a slide’s metadata.

We can start by making a table of all zoomlevels the tiles per zoomlevel, along with the magnification represented at each zoomlevel:

from pma_python import pma
import pandas as pd

level_infos = []
slide = "C:/my_slides/alk_stain.mrxs"
levels = core.get_zoomlevels_list(slide)
for lvl in levels:
    res_x, res_y = core.get_pixel_dimensions(slide, zoomlevel = lvl)
    tiles_xyz = core.get_number_of_tiles(slide, zoomlevel = lvl)         
    dict = {
        "res_x": round(res_x),
        "res_y": round(res_y),
        "tiles_x": tiles_xyz[0],
        "tiles_y": tiles_xyz[1],
        "approx_mag": core.get_magnification(slide, exact = False, zoomlevel = lvl),
        "exact_mag": core.get_magnification(slide, exact = True, zoomlevel = lvl)
     }
     level_infos.append(dict)

df_levels = pd.DataFrame(level_infos, columns=["res_x", "res_y", "tiles_x", "tiles_y", "approx_mag", "exact_mag"])
print(slide)
print(df_levels)

The result for our alk_stain.mrxs slide looks as follows:

Now that we have an idea of the number of zoomlevels to expect and how many tiles there are at each zoomlevel, we can request an individual tile easily. Let’s say that we wanted to request the middle tile at the middle zoomlevel:

slide = "C:/my_slides/alk_stain.mrxs"
levels = core.get_zoomlevels_list(slide)
lvl = levels[round(len(levels) / 2)]
tiles_xyz = core.get_number_of_tiles(slide, zoomlevel = lvl)         
x = round(tiles_xyz[0] / 2)
y = round(tiles_xyz[1] / 2)
tile = core.get_tile(slide, x = x, y = y, zoomlevel = lvl)
tile.show()

This should pop up a single tile:

.Ok, perhaps not that impressive.

In practice, you’ll typically want to loop over all tiles in a particular zoomlevel. The following code will show all tiles at zoomlevel 1 (increase to max_zoomlevel at your own peril):

tile_sz = core.get_number_of_tiles(slide, zoomlevel = 1) # zoomlevel 1
for xTile in range(0, tile_sz[0]):
    for yTile in range(0, tile_sz[1]):
        tile = core.get_tile(slide, x = xTile, y = yTile, zoomlevel = 1)
        tile.show()

The advantage of this approach is that you have control over the direction in which tiles are processed. You can also process row by row and perhaps print a status update after each row is processed.

However, if all you care about is to process all rows left to right, top to bottom, you can opt for a more condensed approach:

for tile in core.get_tiles(slide, toX = tile_sz[0], toY = tile_sz[1], zoomlevel = 4):
    data = numpy.array(tile)

The body of the for-loop now processes all tiles at zoomlevel 4 one by one and converts them into a numpy array, ready for image processing to occur, e.g. through opencv. But that will have to wait for another post.

April 7, 2018June 13, 2018

We did it! SDK update for Python.

SDK update

Now that I have a basic understanding of Python (including the popular packages pandas, matplotlib.pyplot, and (to a slightly lesser extend) numpy), I’m moving ahead and am putting out Java, R, Eiffel, Haskell, and F# API wrapper libraries as part of our SDK!

Ok, perhaps not.

To prevent ending up with a plethora of half-finished sourcecode files across a variety of languages, we thought it more prudent this time to work on a comprehensive library in a single programming language first, and then port it to other environments.

For Python, this means writing one of more modules and publishing it in Python. Others have done this before us, so how hard could it be, right?

So we set off to write an initial module, with a number of procedures to do basic tasks such as obtaining lists of slides, navigating a hierarchical folder structure, and, of course, extracting tiles. The code was deposited in GitHub.

After GitHub typically comes support for the Python Package Installer (PyPI). We recruited somebody through UpWork to help us with this process and here we are: getting an interface to Pathomation software in Python is now as easy as issuing the command “python3.exe -m pip install pma-python”.

Oh, and we already tested this in other environments, too. The following screenshot was taken on a Mac (thank you, Pieter-Jan Van Dam):

Getting started

What can you do with our Python SDK today? If you have PMA.start installed on your system, you can go right ahead and try out the following code:

from pma_python import pma

if pma.is_lite():
    print("Congratulations; PMA.start is running on your system")
    print("You’re running PMA.core.lite version " + pma.get_version_info())
else:
    print("PMA.start not found. Either you don’t have it installed, or you don’t have the server-component running currently")
    raise Exception("PMA.start not detected")

You can use the same is_lite() method by the way to ask your end-user to make sure PMA.start IS running before continuing the script:

from pma_python import pma
from sys import exit

if (not pma.is_lite()):
    print("PMA.core.lite is NOT running.")
    print("Make sure PMA.core.lite is running and press <enter> to continue")
    input()

if (not pma.is_lite()):
    print("PMA.core.lite is NOT running.")
    exit(1)

Slides

Now that you know how to establish the availability of PMA.start as a back-end for whole slide imaging (WSI) data, you can start looking for slides:

from pma_python import pma

if not pma.is_lite():
    raise Exception("PMA.start not detected")

# assume that you have slides in C:\my_slides (note the capital C)
for slide in pma.get_slides("C:/my_slides"):
    print(slide)

But you knew already that you had slides in that folder, of course. By, the way, if NO data shows up, check the specified path. It’s case sensitive, and drive letters have to be capitalized. Also make sure to use a forward slash instead of the more traditional (on Windows at least) backslash.

Now what you probably didn’t know yet is the dimensions of the slide, both in pixels as well as micrometers.

print("Pixel dimensions of slide:")
xdim_pix, ydim_pix = pma.get_pixel_dimensions(slide)
print(str(xdim_pix) + " x " + str(ydim_pix))

print("Slide surface area represented by image:")
xdim_phys, ydim_phys = pma.get_physical_dimensions(slide)
print(str(xdim_phys) + "µm x " + str(ydim_phys) + "µm = ", end="")
print(str(xdim_phys * ydim_phys / 1E6) + " mm2")

Below is the output on our computer, having 3 3DHistech MRXS slides in the c:\my_slides folder. You can use this type of output as a sanity check, too.

While the numbers in µm seems huge, they start to make more sense once translated to the surface area captured. As a reminder: 1 mm² = 1,000,000 µm², which explains why we divide by 1E6 to get the area in mm². 1020 mm² still not saying much? Then keep in mind that 100 mm² equals 1 cm², and that 10 cm² can very will constitute a 2 cm x 5 cm piece of tissue. A physical slide’s dimensions are typically 10 cm x 4 cm. Phew, glad the data matches reality!

Determining a slide’s magnification

We can also determine the magnification at which an image was registered. The get_magnification function has a Boolean exact= parameter that works as follows: when set to True, get_magnification will round to the nearest “whole number” magnification that’s typically mentioned on a microscope’s objective lens. This could be 5X, 20X, 40X… But bear in mind that when a microscopist looks through his device, he can fine-focus on a sample, thereby slightly modifying the actual magnification used, perhaps from 40X to 38X (even though the label on the lens still says 40X of course). Scanners work in the same manner; because of auto-focusing, the end-result of a scan may be in 38X instead of 40X, or 21X instead of 20X. And this is the number that is returned when the exact= parameter is set to True.

Of course, when building a pandas Dataframe, you might as well include columns for both measurements (perhaps using the rounded measurement later for a classification task):

from pma_python import pma
import pandas as pd

if not pma.is_lite():
    raise Exception("PMA.start not detected")

# create blank list (to be converted into a pandas DataFrame later)
slide_infos = []

# assume that you have slides in C:\my_slides (note the capital C)
for slide in pma.get_slides("C:/my_slides"):
    dict = {
        "slide": pma.get_slide_file_name(slide),
        "approx_mag": pma.get_magnification(slide, exact=False),
        "exact_mag": pma.get_magnification(slide, exact=True),
        "is_fluo": pma.is_fluorescent(slide),
        "is_zstack": pma.is_z_stack(slide)
        }
    slide_infos.append(dict)

df_slides = pd.DataFrame(slide_infos, columns=["slide","approx_mag","exact_mag", "is_fluo", "is_zstack"])
print(df_slides)

The output of this script on our computer is as follows:

Note that for one slide, both the exact and the approximate magnification is 0. This is because that particular slide is a .jpg-file, which doesn’t contain any useful (pixels per micron) metadata to use to determine the magnification.

Almost there

In our next post, we’ll show how you can retrieve various types of image data from your digitized slides.

April 6, 2018

E-learning update

Remember my training goals for 2018? I think I’m doing okay. I completed the Excel XSeries at edx. When completing the XSeries, you get a comprehensive dedicated certificates for completing the full track; individual certificates are still available on a per-course basis, too.

Now that I’m done with this, I can definitely recommend an XSeries track at edx, because across the different courses, various aspects on a subject are really approached from different angles. The repetition you get over time is hugely beneficial to get a solid grasp on that subject.

Moving on then.

Since all ML-courses at one point or another converge to Python (including Ng’s new Deep Learning curriculum at Coursera), I temporarily switched to Datacamp’s Python track. This morning, I finished the Python Programmer track (consisting of 10 individual courses).

In the next two blog posts, I’ll talk more about how I’m putting my newly learned Python skills to work. Stay tuned.

March 29, 2018March 29, 2018

About APIs and SDKs

API vs. SDK

Early on in the life of Pathomation, it became clear that in order to tackle the variety of use cases out there for digital pathology, we needed to build of piece of veritable digital chameleon software.

Luckily there is a way to do exactly that in engineering, and that is through the establishment of an Application Programming Interface (or API for short). It all starts with an API.

Hugo Bowne-Anderson of Datacamp explains what that means: An API is a set of protocols and routines for building and interacting with software applications.

Why are APIs important? An API is a bunch of code that allows two software programs to communicate with each other. You can connect to an API, pull data from them, and subsequently parse that data.

Using APIs has become the standard way to interact with applications ranging from Wikipedia to Twitter. PMA.core (and its little brother PMA.start) has an API, and our own product suite makes heavy use of it.

So while PMA.core isn’t our main product, it is definitely where everything starts, and today I see how many of our success stories can be attributed to the API that PMA.core offers.

Version 2

Currently we’re in the works of wrapping up version 2 of PMA.core. As it should be, we learned a lot in the last couple of years about how (not) to use our own interfaces.

Some good things:

Under the motto “eat your own dogfood”, we’ve successfully employed the API to our own benefit in separate projects that were built on top of PMA.core. The Willy Gepts collection exploits not only our slide visualization interface, but also our metadata engine. Pathotrainer is a great example of an end-user product built on top of PMA.core in the pharmaceutical space.
With a limited number of calls (about two dozen), we’ve been able to facilitate a very broad span of downstream consumers. Examples include Blackboard at the University of Antwerp, a completely custom HTML website for med school students in Brussels, and recently a completely new type of courseware for clinical (research) stakeholders.

Some not so good things:

We limited ourselves in terms of granularity. Here’s an example: PMA.core offers the possibility to store slide meta-data in an audit-trailed, 21CFR part 11-compliant manner. However, the present interface only allows to get data for one slide, one form at the time. So when you work on courseware projects like PathoTrainer, you need to put in additional progress bars, while the underlying data is retrieved in an atomic fashion. In contrast, of course, you’d much rather be able to retrieve the n metadata sets for m slides in a single call. In version 2 we will be able to do just that, and more.
We need much more documentation and more sample code. Pointing people to your WSDL manifest is NOT sufficient.

SDK

We’ve always had the idea to bring out a complete Software Development Kit (or SDK for short). The idea for an SDK is simple enough: take everything your API can do and build content every call and every combination of parameters imaginable.

But what exactly do you put in it? What languages do you support? Looking around our own environment, we chose Microsoft.Net, Java and PHP as prime candidates. We know there are more of course, but you have to pick your battles.

We thought we were doing pretty well, having Java and Microsoft.Net desktop application sample code, until one partner told us they were using Delphi. What’s next? Windev https://fr.wikipedia.org/wiki/WinDev?

Two problems then exist with the SDK the way we currently have it:

When we make a matrix with documented features and supported languages, we end up with a rather sparse matrix. This means that some things are documented in one language, but not in another. There are a couple of essential tasks that we have in any environment of course, like establishing a connection, but it’s inconsistent at best. We thought actually that we could do this demand-driven, meaning that when someone asks us how to do task t in language l, we just fill up cell (t, l) in the foresaid matrix. This works, to some extend; you can respond to the client, and be confident that the code you contribute is actually something application develops want. But the end-result is messy and doesn’t look very professional.
Our code examples that we’ve been adding were mostly wrappers around API calls. In Microsoft.Net, we’d a WebRequest call and interpret the returned JSON or XML stream. In PHP, we’d do a file_get_contents of whatever API call we needed to get the job done, and again interpret the results. This got the job done, but as a result much of the code that we delivered was more a tutorial in how to read webcontent and interpret the returned structured data. Ideally however, these should focus more on what can actually be done with the software (instead of how to do something).

Wrapper libraries

For the current version 2 then, I want to be more up-front with our SDK offering. I want to be better prepared. I don’t think we should try to convince people anymore to “just” go and use our API. It’s too abstract, too inconsistent even at places (sometimes for historical reason; who thought you could grow legacy so quickly?), and frankly too steep a learning curve probably for a great many people.

Pathomation offers a platform for the development of digital pathology software. I want to make sure people like to stand on our platform to begin with. It’s not necessarily what I want to do w/ it… it’s what others would want to with it.

We have already have code in python that fetches images from PMA.core and does “something” with them. But there’s nothing fancy about your sample Python script; you DL a /region URL, then you process it like any other image. This is at API level.

A wrapper library can now add more functionality; repetitive basic tasks. What do developers not like to do? Write plumbing code. We can encapsulate all of that under the hood of a PyPI package or Java namespace. We’re already doing that for the front-end handling and representation of whole slide imaging content with our Javascript PMA.UI framework.

Here’s another idea: I can’t do a nuclear cell count on an (reduced resolution) overview image; I need at least 20x resolution for that. But I can do a cell count on an individual tile at high resolution, and then put on a dot on the overview image to at least indicate where the cell was found. What I would want to do instead is create an overview image of e.g. 1000 x 2000 pixels, then loop through x * y tiles at a zoomlevel that in reality represents 5000 x 10000 pixels (but which is too bulky to process in one time via /regio; we’re not VIPS); process the individual tiles, scale and imprint the result from each tile back onto the 1000 x 2000 pixel overview image.

I can imagine a class representing a slide that has indexer logic that retrieves tiles in real time (sort of like a programmer’s server-side version of PMA.UI), but then destroys them again once the object is not needed anymore (so the memory usage doesn’t explode).

from pathomation import pma

dir = pma.get_first_non_empty_directory()
slide = pma.get_slides(dir)[0]

max_zoomlevel = pma.get_max_zoomlevel(slide)
print ("Max. Zoomlevel: " + str(max_zoomlevel))
print ("Size in pixels: " + str(pma.get_pixel_dimensions(slide)))
print ("Resolution (PPM): " + str(pma.get_pixels_per_micrometer(slide)))
print ("Physical resolution (µm x µm): " + str(pma.get_physical_dimensions(slide)))
print ("Number of channels: " + str(pma.get_number_of_channels(slide)))
print ("Slide is fluorescent? " + str(pma.is_fluorescent(slide)))
print ("Number of tiles: " + str(pma.get_number_of_tiles(slide)))

selectedZl = 1 # do the following on zoomlevel 1 for demo purposes
tileSz = pma.get_number_of_tiles(slide, selectedZl) # zoomlevel 1
for tile in pma.get_tiles(slide, toX = tileSz[0], toY = tileSz[1], zoomlevel = selectedZl):
     tile.show()
     # do something with the tile

In closing

There’s a tremendous problem today with scientific software: published methods are described at a very high level; and not at all easy to replicate. And when they are detailed enough, or source code is made available, it’s not in an accessible language like Python (or Java for that matter; or it has sooooo many dependencies…). Too many research papers can be concluded with “now good luck finding the one former postdoc that actually knew how to do this…”

Remember “Developers developers developers”? Yes, make fun of Steve Balmer all you want, but he got it right. You offer people basic services, and then get those people to develop software on top of your infrastructure.

So how do we intend on addressing this? By continuing to make our own software as versatile and kick-ass as possible (duh 😊), but also by going just one step further and reaching out to the legion of developers and researchers out there that currently still just have to make do with what they can get their hands on. We claim to have a better mousetrap for you, and we’ll prove it to you.

March 20, 2018September 24, 2018

How to test and benchmark a tile server?

As mentioned before, I’m the Chief Technology Officer of Pathomation. Pathomation offers a platform of software components for digital pathology. We have a YouTube video that explains the whole thing.

You can try a local desktop-bound (some say “chained”) of our software at http://free.pathomation.com. People tell us our performance is pretty good, which is always nice to hear. The problem is: can we objectively “prove” that we’re fast, too?

The core components of our component suite is aptly called “PMA.core” (we’re developers, not creative namegivers obviously). Conceptually, PMA.core a slide tile server. Simply put, a tile server serves up data in regularised square shaped portions called tiles. In the case of PMA.core, tiles are extracted in real time on an as-needed basis from selected whole slide images.

So how do you then test tile extraction performance?

At present, I can see three different ways:

On a systematic basis, going through all hypothetical tiles one by one, averaging the time it takes to render each one.
On a random basis
Based on a historical trail of already heavily viewed images.

Each of these methods have their pros and cons, and it depends on what kind of property of the tile server you want to test in the first place.

Systematic testing

The pseudo-code for this one is straightforward:

For x in (0..max_number_of_horizontal_tiles):
  For y in (0..max_number_of_vertical_tiles)
    Extract tile at position (x, y)

However, we’re talking about whole slide image files here, which have more than just horizontal and vertical dimensions. Images are organized as a hierarchical, pyramid-structured stack, and can also contain z-levels, fluorescent layers, or even timelapse data. So the complete loop for systematic testing goes more like this:

For t in (0..max_timeframes):
  For z in (0..max_z_stacks):
    For l in (0..max_zoomlevels):
      For c in (0..max_channels):
        For x in (0..max_number_of_horizontal_tiles):
          For y in (0..max_number_of_vertical_tiles):
            Extract tile at timeframe t, z-stack z, zoomlevel l, channel c, position (x, y)

But that’s just nested looping; nothing fancy about this, really. We’ve been using this method of testing for as long as we can remember pretty much, and even wrapped our own internal tool around this, (again very aptly) called the profiler.

What’s good about this systematic tile test extraction method?

Easy to understand
Complete coverage; gives an accurate impression of what effort is needed to re construct the entire slide
Comparison between file formats (as long as they have similar zoomlevels, z-stacks, channels etc.) allow for benchmarking

What’s bad about this extraction method?

It’s unrealistic. Users never navigate through a slide tile by tile.
Considering the ratio of the data being extracted from different dimensions that can occur in a slide, you end up over-sampling some dimensions, while under-sampling others. Again this results in a number that, while accurate, is purely hypothetical, and doesn’t do a good job at illustrating the end-user’s experience.
In reality, end-users are only presented with a small percentage of the complete “universe” of tiles present in a slide. Ironically, the least interesting tiles will take the smallest amount of effort to send back (especially in terms of bandwidth, like “blank” tiles containing mostly whitespace on a slide or lumens within a specimen etc.)

Random testing

In random testing, we extract a pre-determined (either fixed number or percentage of total number of total available tiles). The pseudo-code is as follows:

Let n = predetermined number of (random) tiles that we want to extract
For i in (0.. n):
  Let t = random (0..max_timeframes)
  Let z = random (0..max_z_stacks)
  Let l = random (0..max_zoomlevels)
  Let c = random (0..max_channels)
  Let x = random (0..max_number_of_horizontal_tiles)
  Let y = random (0..max_number_of_vertical_tiles)
  Extract tile at timeframe t, z-stack z, zoomlevel l, channel c, position (x, y)

The same statistics can be reported back as with systematic testing, in addition to some coverage parameters (based what percentage of total tiles were retrieved).

Let’s look at some of the pros and cons of this one.

Here are the pros:

Faster than systematic sampling (see also the “one in ten rule” commonly used in statistics: https://en.wikipedia.org/wiki/One_in_ten_rule)
For deeper zoomlevels that have sufficient data, a more homogenous sampling can be performed (whereas systematic sampling can oversample the deeper zoomlevels, as each deeper zoomlevel contains 4 times more tiles).
Certain features in the underlying file format (such as storing neighboring tiles close together) that may unjustly boost the results in systematic sampling are less likely to affect results here.

What about the cons?

Smaller coverage may require bootstrapping to get satisfying aggregate results.
Random sampling is still unrealistic. Neighboring tiles have less chance of being selected in sequence, while in reality of course any field of view presented to an end-user is the result of compound neighboring tiles
Less reliable to compare one file format to the next, as this may again require bootstrapping.

Historic re-sampling

A third method can be devised based on historic trace information for one particular file. A file that’s included in a teaching collection and that’s been online for a while, has been viewed by hundreds or even thousands of users. We found in some of our longer running projects (like at http://histology.vub.ac.be or http://pathology.vub.ac.be) that students under such conditions typically are presented with the same tiles over and over again. This means that for a given slide that is fairly often explored, we can reconstruct the order in which the tiles for that particular slide are being served to the end-user, and that trace can be replicated in a testing scenario.

In terms of replication, this is then the most accurate way of testing. Apart from that, other advantages exist:

This is the best way to measure performance differences across different types of storage media. If for some reason a particular storage medium introduces a performance penalty because of its properties, this is the only reliable way to determine whether that penalty actually matters for whole slide image viewing.
For large enough numbers (entries in the historical tracing logs), a “natural” mixture of different tiles in different zoomlevels, channels, and z-stacks will be present. This sequence of tiles presented in the trace history automatically reflects how real users navigate a slide.

However, this method, too, has its flaws:

This type of testing and measuring cannot be used until a slide has actually been online for a certain time period and browsed by a large number of end-users.
Test results may be affected by the type of user that navigates the slide: we shouldn’t compare historical information about a slide browsed by seasoned pathologists with how novice med school students navigate a different slide. Apples and oranges, you know.
Because each slide has its own trace, it become really hard to compare performance between different file formats.
Setting up this type of test requires, of course, historical trace information. This means that this test is the most time consuming to set up: IIS logfiles have to be parsed, tile requests have to be singled out, matched to the right whole slide image etc.

Preliminary conclusions

This section came out of discussing the various strategies with Angelos Pappas, one of our software engineers.

The current profiler that we use was built to do the following:

Compare the performance impact of code modifications in PMA.core. For example by changing around a parser class, or by modifying the flow in the core rendering system etc… We needed a way to relatively compare what’s the difference between versions.
Compare the performance when rendering different slide formats. To do this, you need similar slides (dimensions, encoding method and of course pixel contents), stored in different formats. The “CMU-{N}” slides from OpenSlide are a good case, as well as the ones we bring back ourselves from various digital pathology events. This again, allows us to do relative comparisons that will give us hints about why a format is slower than another. Is it our parser that needs improvement? Is it the nature of the format? etc.
Compare the performance of different storage sources, like local storage versus SMB.

The profiler does all of the above nicely and it’s the only way we have to do such measurements. And even though the profiler supports a “random” mode, we hardly ever use it. Pathomation test engineers usually let the profiler run up to a specific percentage or for a specific period and compare the results.

Eventually what you want to accomplish with all this is to get an objective measurements for user experiences. The profiler wasn’t really meant to measure how good the user experience will be. This is a much more complicated matter, as it involves patterns that are very hard to emulate, network issues, etc etc. For example, if a user zooms into a region, the browser fires simultaneous requests for neighboring tiles. If you ever want to do this kind of measurements, perhaps your best bet would be to do this by commanding a browser. Again though, your measurements would give you a relative comparison.

February 19, 2018February 19, 2018

Backup and reboot

A new year starts with good intentions. For me, the new year 2018 in part started with the realization that it’s been forever that I’ve written anything on the RealData blog.

I like writing. I like passing on knowledge. So the most pertinent question to ask then perhaps is: why did I stop?

Let’s start with the obvious one: lack of time. Getting content out in the right format is hard. It’s one thing to jot down some notes in a diary (a jupyter notebook perhaps); it’s quite another to deliver a publishable readable blog-entry.

And after all: what’s the point, right? Who reads this anyway? You? Why? Do I even have anything to tell you?

I think my journey is still worth sharing. So let’s look at some of the things that went wrong and contributed to my preliminary failure:

I slacked off on the online courses I was planning on taking back in April 2017. Perhaps even worse: I wasn’t passing anymore. At least not as easily anymore as I did in the beginning. Why was that?

I’m now convinced that taking online courses is an art in itself. It’s really easy to go to edx.org, or Coursera, or Datacamp, or any other platform and have the intention of “Let’s take every data science course there is and become a data scientist”. They all look so interesting, right?

I remember passing the Datacamp course Introduction to Python for data science (aka “Python for beginners”) in two days, with a 99% final score.

I started struggling really bad taking the advanced Programming with Python for data science course. Again: why? Because anyone who’s been programming for over 25 years can probably pass any beginning programming course in any language. I was forgetting that I’m pretty proficient in C# today because I have been using it ever since the very first European .Net conference in Copenhagen, Denkmark in 2001.

In order to become a proficient Python programmer, I have to start using the language myself in daily tasks. An obvious one, I’m afraid, but saying it is easier than doing it. There’s a level of humility attached to this: it’s hard for me to spend half an hour trying to figure out how to process something as trivial as a text file in Python, when I *know* I can write the same script in PHP in 5 minutes. Hey, I can probably do it with a console application in C#, too!

A few years ago, I was mentoring someone into software development. The person was highly educated, had some scripting experience with Excel VBA, and seemed ready for the next level. At one point I was looking something up for her on StackOverflow when she commented “oh, so programming is just copy/paste, right?”. Well, yes and no. Mostly no, of course. But websites like StackOverflow can make it look that way. And sometimes we fool ourselves into thinking that it really has become that easy. It hasn’t

At around the same time that I was flunking my advanced Python curriculum, software testing was rapidly become a top-priority at Pathomation. Huzzah, edx.org also appeared to have a “micromasters” track in software testing. This track consisted of only 3 courses, which seemed more manageable than the entire data science track.

I took the first course and passed. I should point out that these are academically organized courses, considered to be graduate level, and passing them is actually somewhat tough: you have to get an 80% in order to get the certificate.

Unfortunately, the testing curriculum went the same way as the data science curriculum I signed up. I passed the first course; but slacked off halfway during the second course. There’s only so much coursework your brain can process in any given timeperiod, too.

Half full or half empty? I passed the first course, but never got around to taking the other two

They say that good intentions have a higher success rate of being realized when you write them down. So, here’s my intention for 2018: I’ve backed up a little and given it some consideration of why I failed in my attempts last year.

So let’s take a reboot now and set out to complete the following courses and education tracks in 2018:

Microsoft Excel for the data analyst XSeries program (more on that in a next post)
Andrew Ng’s machine learning course
Fast.IA deep learning course
Andrew Ng Deep learning course

Curious to see if I’ll make it this time? Keep following this blog, then!