Three ways to transfer your virtual slides

Uploading and downloading

The Pathomation software platform for digital pathology and virtual microscopy offers powerful slide presentation and interaction capabilities.

Much of the focus (especially with respect to end-users) is on slide visualization. But workflow can be organized through our platform as well.

In this article we focus on upload- and download-capabilities. We distinguish between three different mechanisms. We also provide background and insights into how and why we decided to provide the functionality in this fashion.

PMA.transfer

For many cases, PMA.transfer is your initial workhorse of choice. It’s a user-friendly end-user facing application with many features. For ad hoc slide transfers from your local hard disk to a PMA.core instance, PMA.transfer is the go-to tool.

There are two ways to obtain PMA.transfer: you can download it individually from its own website, or it can be installed in combination with your latest PMA.start download during the installation procedure.

If you already have PMA.start up and running and don’t want to go through the trouble of downloading the complete setup package again, you can download PMA.transfer individually through its own website at https://www.pathomation.com/pma.transfer. From there, you can also select individual versions of the software.

Regardless of how you obtain PMA.transfer, the software does require you to have PMA.start up and running, to function itself.

The reason for this is that PMA.transfer relies on PMA.start to tell its what slides reside on the hard disk. PMA.start already has all the logic on board for slide data processing, and we didn’t want to re-implement all this code in PMA.transfer. The result is that PMA.transfer knows what a slide is on your hard disk through PMA.start. You need not concern yourself figuring out if it’s a single- or multi-file file format that you’re working with. PMA.transfer deals with slides and that’s it. Do read our other blog article to find out why virtual slides are so big and complicated in the first place.

Transferring slides with PMA.transfer

Once you have PMA.transfer open, you see the slides on your own hard disk in the left-hand panel. You can connect to any PMA.core instance that you have the credentials for. If you find yourself re-connecting to the same server again and again, we recommend that you use the site manager. You can also use the site manager for complex scenarios like geo-replicated PMA.core instances.

You can both upload and download slides with PMA.transfer. What it is exactly that you do depends a bit on your point of view. Typically upload goes from your local computer to the PMA.core server; download is the reverse (from PMA.core to your hard disk). But there are ways to rig PMA.transfer so it operates between two server instances of PMA.core instead of interfacing with just your local computer. Essentially, you’re simply transferring slides from one location to another. The same API calls are being user underneath the hood regardless of whether you’re doing one or the other. More on that API later in this article by the way. Stay tuned.

We tried to make PMA.transfer extremely low-threshold and userfriendly, which means that there’s usually more than one way to get something accomplished. For instance, you can transfer slides from one side to the other either with context-menus, or via drag and drop. Selecting multiple slides works the same way as you’re used to from the Windows Explorer with Ctrl+Click and Shift+Click actions.

A full manual of PMA.transfer workings, interfaces, and best practices is available as an online wiki.

PMA.core

As mentioned above: PMA.transfer can be configured for site to site transfer. But even in that scenario, it still relies on PMA.start. This isn’t always convenient or even possible. Therefore; PMA.core 2.0 and higher contains its own slide transfer interface. It’s not as feature-rich as PMA.transfer, but if you quickly want to copy a large volume of slides from one server to another, it will get the job done. Plus, you can use this to copy slides from anywhere to anywhere; Migrating your old FTP server to updated S3-based cloud storage becomes a breeze. You can do it remotely, and asynchronously: just close your browser after initiating the operation and check back later.

Back-end API features

Occasionally, we get the question from people “ok, but I don’t want drag and drop stuff. I have an [incoming] folder somewhere on my system, and I want to automatically transfer all of those slides to their final destination overnight, when network use is low and I don’t feel like I’m hogging other people’s bandwidth…”. They then ask for the API call to make this happen.

If you’re one of these people: you’re on the right track. But what you’re really asking for is automation. Bear with us here. The API is part of that, but not the whole story.

See, these are the various API calls involved in slide transfer:

PMA.transfer uses these; and so does PMA.core. They work. But we still highly recommend that you do not engage in calling these methods yourself. Instead, you should rely on the SDKs instead.

The reason for this is related to the underlying structure of the virtual slides themselves. There can be many files, they can be big. Therefore, the API methods are mostly involved with uploading chunks or partial slide data to PMA.core. We understand that there may be instances where it is convenient to have direct interaction with these, to perhaps build progress indicators to monitor a processing workflow. For those uses, we recommend studying the implementation of the Core::upload() methods in PMA.python.

Automation

We illustrate here how slide transfer automation can work for you by means of a Jupyter notebook utilizing the PMA.python SDK.

The Core module contains both upload() and download() methods, which each rely on PMA.start.

When using the upload() function, the SDK assumes you’re transferring slides from PMA.start to PMA.core. Therefore, it’s good to do some preliminary verification and make sure that everything is in place before starting your transfer:


from pma_python import core
sourceSession = core.connect()
targetSession = core.connect("https://test.pathomation.com/ pma.core", "username", "scrt")
print(sourceSession, targetSession)

This block of code should result in two meaningful strings. If not, you can already interrupt the flow and send an email to a system administrator, informing him or her that something is wrong.

After confirming the connection, you should also check that both your source folder on your local hard disk (this can be the folder where your scanner deposits new WSIs).

Since you’re handling two PMA.core instances (PMA.start utilizes a limited of PMA.core, too), it’s a good idea to explicitly mentioned the PMA.core sessionID values as optional arguments in your code (we typically don’t do this when only interacting with a single instance; remember: good programmers are not only lazy but also know when to be):

core.get_directories("C:/", sessionID = sourceSession)
core.get_slides("C:/wsi", sessionID = sourceSession)
core.get_directories("_sys_ref/", sessionID = targetSession)
core.get_slides("_sys_ref/experiment", sessionID = targetSession)

Once you’re assured that your source data is available and your target folder is ready (and doesn’t have the data you’re looking for yet), it’s time to do the actual upload:

core.upload("C:/wsi/OS-3.ndpi", "_sys_ref/experiment", targetSession)

This call is spectacularly unimpressive. But behind the scenes, it figures out which files belong to the slide, and transfers them one by one to your target destination,. Large files are automatically split in smaller blocks.

And of course this can be made a lot more complex. This is the step where the magic happens: you can build a loop around all the slides found in the source folder, and you can add checks to make sure the target folder is empty to begin with (or at least doesn’t contain the to be transferred slide yet).

Regardless, after the transfer is complete, you want to verify that everything went right.

You instinct probably tells you to compare the ImageInfo objects on both sides:

However, this is not a good idea, as the ImageInfo will actually differ, as it contains a couple of location-dependent and slide-specific criteria as well. Comparing the information then just becomes confusing.

Rather, we recommend that after transferring a slide, you merely pull up the fingerprint for each slide:

core.get_fingerprint('_sys_ref/experiment/OS-3.ndpi', sessionID = targetSession)
core.get_fingerprint('C:/wsi/OS-3.ndpi', sessionID = sourceSession)

Comparing these two strings is a lot easier, whether through visual (interactive) or automated detection:

Which way do I take?

Due to their size and file structure, whole slide images are complicated data-beasts. ~~Wrestling~~ Managing them takes some time and practice.

The Pathomation software platform for digital pathology and virtual microscopy recognizes that both simple and complex slide transfer scenarios emerge from daily practice. Therefore, we offer different tools and routes to efficiently transfer slides, depending on which category of user you fall into and the scenario you with to implement.

June 21, 2021June 22, 2021

Annotations et al

On-slide annotations

Do you recognize the above image? It’s a rendering of the mitotic figures and algorithmic classifications as determined by Bertram at al in their 2019 paper.

Digital pathology and virtual microscopy are concerned with slides (duh), but those are only part of the equation. External data as well as on-slide annotations are an integral part of the package, and any self-respecting software in this space supports various flavors of annotations. The various components in the Pathomation platform are no exception.

First some terminology: we generally refer to on-slide annotations that are geometric shapes and figures presented on top of virtual slide pixels in order to distinguish from (text-based) slide meta-data. The latter is presented as data attached to a slide, but not particularly or directly associated with a specific region on the slide. We discussed how Pathomation handles various flavors of slide metadata in an earlier blog post.

Creating on-slide annotations in PMA.studio

With the forthcoming release of PMA.studio 2.0, you can create on-slide annotations interactively.

PMA.studio offers a ribbon tab for this purpose. The first group of buttons lets you control what exactly is visible: the annotations themselves can be toggled, and you can choose whether to include the annotations labels or not.

https://realdata.pathomation.com/wp-content/uploads/2021/06/blog-post-42_20.png

The Style group in the ribbon is used to change the presentation of annotations. Annotations can have an edge color, fill color, which can be set independently from each other. Filled shapes can have a transparency attribute, too.

Each annotation can be associated with a class and a description attribute. An example would be a set of polygon shapes that each indicate necrosis and therefore have the “necrosis” class attribute, while individual shapes would be referred as a “necrosis-region-1”, “necrosis-region-2” etcetera.

Once you’ve made the annotations, you can follow-up on them through the annotations panel. Here, annotations are grouped per class. You can also use this panel to filter annotations per user.

PMA.studio’s ribbon can be customized. This means that custom annotations are possible, too. You can use this feature to implement protocols.

In the example below, we’ve used XML to define pre-sets for pathologists to indicate various types of TLS regions:

https://realdata.pathomation.com/wp-content/uploads/2021/06/blog-post-42_40.png

Behind the scenes sits PMA.core

Let’s spend some time on how it all works behind the scenes.

When you make annotations in PMA.studio, you interact with a PMA.UI viewport. Once you decide to save you annotations, it’s the PMA.UI viewport that sends your annotations to PMA.core, where they are saved in the back-end database.

The format in which annotations are saved is Well-Known Text (WKT).

PMA.core has API calls to work with annotations. PMA.UI makes heavy use of these.

External annotations

Because we totally understand that PMA.studio may not be your first environment of choice to make your on-slide annotations with (though we think it’s a really good one!), PMA.core supports various other types of on-slide annotations, too.

We distinguish between “native” and “third-party” annotations.

Native annotations are shapes and forms included in the original vendor’s file format. Several vendors provide the option of making on-slide annotations in their own native viewers. Examples include 3DHistech and Aperio. If you have a 3Dhistech MRXS file that has overlaying annotations created by Case Center or the Pannoramic viewer, PMA.core will render them accordingly. The same goes for Aperio SVS files: just make sure you put the .xml file from ImageScope next to the corresponding .svs file.

Other vendors support annotations, too. If you find yourself with a vendor-specific annotation file format, do tell us, and we’ll add it to our next version of PMA.core.

Third-party annotations are another kind than “native” annotations. Like PMA.core’s own (WKT-encoded) annotations, these are created in software that is not coming from the vendor. Typically we’re talking about image analysis software. The three big names are out there are Definiens, Visiopharm, and Indica Labs HALO. Each of these environments is supported.

We’re not technically hindered to only support these abovementioned flavors of image analysis. So if you have a different environment from the one listed, let us know.

Transient annotations through PMA.live

Two products of Pathomation support real-time conferencing: PMA.studio and PMA.control. While in a conference, it is possible for participants to make live annotations that only exist for the duration of the conference.

Transient annotations in PMA.studio and PMA.control (through a third component referred to as PMA.live) can perhaps best be compared to the annotation toolbar that appears while giving a Powerpoint presentation: several tools are made available to temporarily highlight specific features, but these don’t become part of the original presentation.

Bringing it all together

As the “middleware for digital pathology and virtual microscopy” company, we take our job seriously. Therefore, apart from managing slides, we also allow our tile server PMA.core to be used to organize graphical annotations. The Pathomation platform offers many ways to organize

Pathomation can ingest both native and third-party annotations. The difference is that we consider native annotations to be annotations made with the original manufacturer’s (viewer) software, while third-party annotations are annotations created by independent vendors like Indica Labs or Visiopharm.

We also allow people to make annotations on top of slides using only components within the Pathomation platform. You don’t need external software to get started with annotating your slides; you can use our own PMA.studio, or even couple OpenCV output back to PMA.core through our API.

There’s more to say about annotations. In upcoming articles, we plan to show you how you can manage heterogenous annotation data from many sources, as well as how to use the back-end API directly to feed annotations back to PMA.core.

Want to see us work out a specific use case for your annotation workflows? Let us know.

June 4, 2021June 4, 2021

Research publications in digital pathology

At Pathomation, we’re big fans of XKCD. And Randall Munroe seems to have hit a particular sore point with scientists recently, if the follow-up article in the Atlantic is to be believed.

We’re not endorsing the Atlantic here. It’s easy to critique the team if you’re a bystander and not actually playing the game.

All the same, we conducted our own small study and found that nobody had created an adaptation for the field of digital pathology yet. So we made our own.

The following data is therefore completely made up. We post it in the public domain, free for all to share, and hope that it will benefit humanity as a whole:

Read the original version of the cartoon at https://www.xkcd.com/2456.

April 19, 2021April 21, 2021

Fingerprinting applications

A naïve way to detect duplicate data

We often copy data for a variety of reasons. While testing a new program, we can copy data any number of times so the software is able to work on a dataset instead of a single datapoint. Temporary data unfortunately is all too often forgotten about, lingers around, and unnecessarily clutters our hard disks.

There are a number of ways to solve this problem. With Python, we can create a script that retrieves all slides, inspects the file size of each slide, and reports which two slides have the same size.

Using our PMA.python SDK, the code looks like this:

from pma_python import core
core.connect()
all_slides = core.get_slides("C:/", recursive = True)
def get_slide_size(slide):
    info = core.get_slide_info(slide)
    return info["PhysicalSize"]

all_sizes = {}
for slide in all_slides:
    fp = get_slide_size(slide)
    if "d" in fp.keys():
        fp = fp["d"]
        # print (slide, fp)
        if not fp in all_sizes.keys():
            all_sizes[fp] = []
        all_sizes[fp].append(slide)
for (k, v) in all_sizes.items():
    if len(v) > 1:
        fn = v[0]
        if not (".png" in fn.lower() or ".jpg" in fn.lower()):
            print(k, len(v))
            print(v)

But wait, what if two slides are the same size? Whole slide images are typically 100s of megabytes in size. Based on this characteristic, you could assume that it’s unlikely two slides result in identical file sizes. But macroscopic images are important in pathology, too, and then we’re talking about file sizes that are only a couple of megabytes in size at most. Now think of a scenario where tens of 1000s of cases are treated annually, with multiple macroscopic photos being taken of each resection piece… Suddenly the chances of two files having the same size becomes plausible.

There’s a second, albeit somewhat more hypothetical reason, why slide size is a poor indicator here. Wsi data could be stored in a container format, and the container format can have certain limitations. We observed e.g. that our PMA.start installation package has now not changed in size for the last 7 releases or so. But of course our code did change. So, empirically, file size is not a good discriminant for executable files. We feel therefore that we cannot assume that this would be the case for image file formats. Since re-scans are a specific concern with microscopy and WSI data, something better is needed than just the filesize.

Introducing fingerprinting

We can think of a way to unambiguously distinguish slides from one another by combining a number of characteristics into a digital fingerprint. These would include:

Filesize (we didn’t say this was a bad one; just an insufficient one)
Pixel size
Pixels per micron
Number of channels
Number of z-stack layers

If we had infinite real-time computing power, we can think of more:

Histogram https://en.wikipedia.org/wiki/Color_histogram
A CRC32 checksum of the slide’s thumbnail https://en.wikipedia.org/wiki/Cyclic_redundancy_check
An SHA-2 checksum on all physical slide content https://www.php.net/manual/en/function.hash.php

For practically, we define a slide’s fingerprint in the Pathomation platform as a combined hash https://en.wikipedia.org/wiki/Hash_function of physical file size, as well as most of the parameters returned through the GetImageInfo method.

We also consider this fingerprint method to be essential, and so for stability, it is incorporated at the level of PMA.core (PMA.start) itself rather than at SDK level, so it can be transferred across programming boundaries. A fingerprint for slide [foo] requested through PMA.java yields the same results as when requested through PMA.php or PMA.python.

Slide integrity

The fingerprint method is a good way to confirm the integrity of a slide itself. When a file is not what it pretends to be, the fingerprint cannot be calculated, and an error follows.

Note that the above would not be possible if we stuck to conventional CRC-like checks, since those don’t take into account the nature of a slide. Of course, you can do a CRC check on any file regardless of whether it actually is a slide or not.

Applications

We recently introduced PMA.transfer. Have you ever been frustrated by people sending you individual VSI or MRXS slides without anything else? Did you ever feel uneasy about just having transferred a gazillion number of bytes half across the world, without any reassurance of whether it actually worked? Then you should definitely have a look at PMA.transfer. It’s like FileZilla, but for slides. SlideZilla.

PMA.transfer uses fingerprinting to ensure data integrity in between transfers. Whether you’re moving slides from and to PMA.start, PMA.core, or My Pathomation, the same fingerprint calculation algorithm is used to compute a slide’s unique signature. This means that PMA.transfer can obtain the fingerprint of a source and a target instance and simply compare one with another to see that they’re identical.

Another application is found in our upcoming PMA.studio product. Of course, the actual fingerprint of a slide is shown in the slide info panel. But a string like “” is not saying a whole lot and is for purely informative purposes at best.

PMA.studio can also be used to create annotations on slides. When you make an annotation, it is stored in our back end with a reference to the original slide, as well as the slide’s fingerprint.

This serves two purposes:

After moving a slide to a new folder path or even physical location, you can still retrieve its annotations.
When you have two identical copies of slides, each annotated separately, you can use fingerprinting to combine the annotations from both in a single view.

The possibility to combine annotations from identical slides stored in different locations in a single view offers opportunities for blinded studies and validation exercises. Inter- and even intra-observer variability can be measured this way, too.

Retrieving annotations by fingerprint is not available by default; you need to invoke this with an explicit button in the ribbon. It’s a performance thing.

Last but not least, fingerprinted annotations can be used to keep track of annotations during migration processes. As your applications for digital pathology increase, you will occasionally restructure your folder structures, or perhaps move to an entire new storage device altogether.

Finding duplicates

Back to our original question: imagine that you’ve been managing a whole slide repository for a while, and as careful as you’ve been, you suspect that you now have ended up with a number of copies of a variety of slides in different locations. You know: you copy a slide to test something, pinky-promise yourself that you’ll remove the slide again afterwards, that for real you’re really not going to forget this time… and then… you forget about it.

Thanks to the fingerprinting method and a few lines of Python, it is easy to trace duplicates however.

Here’s the basic code to build a dictionary that has all the possible fingerprints as keys. Each entry then contains a list that specifies where the exact copies that share a particular fingerprint:

slides_by_fingerprint = {}
for slide in slides:
    fp = core.get_fingerprint(slide)
    if not fp in slides_by_fingerprint:
        slides_by_fingerprint[fp] = [slide]
    else:
        slides_by_fingerprint[fp].append(slide)

When a slide is unique, then the list of the dictionary will only have one entry. Alternatively, it can have two or more entries. So the duplicated slides are detected and flagged as follows:

for fprint in slides_by_fingerprint:
    if len(slides_by_fingerprint[fprint]) > 1:
        print(slides_by_fingerprint[fprint][0], "is copied", len(slides_by_fingerprint[fprint]), "times")

If you want, you can further automate the pruning and the deletion of these duplicates. Sometimes it’s easy; sometimes it’s not. You need to make sure that you have the original copy in its intended place. And in some cases, you may actually want to keep at least a second copy of a slide around, as one may be transient in a clinical setting, whereas its copy may have just been added to a reference repository to teach students and staff.

Coming full circle

Fingerprinting serves a triple function:

Detect whether a dataset is a real slide or not
Guard data integrity when transferring from one medium to another
Trace slides and associated content through a complex storage hierarchy

Fingerprinting applies the concept of hash-functions to slides. Like everything in the Pathomation platform, the slide itself is the key unit to interact with. There is only one fingerprint for a slide, whether it consists of a single, or multiple files. Consequently, you can only obtain a fingerprint for a slide. If the file is somehow corrupt or the file format isn’t recognized by PMA.core, you’re not going to get a fingerprint from it. Last but not least, fingerprints are invariant across storage media and instances of PMA.core (PMA.start), making it a useful feature for slide tracking.

March 25, 2021March 25, 2021

Random sampling and ground truth annotations

Challenges and opportunities

It’s been well established that whole slide images are big. We wrote a tutorial on this ourselves.

This poses challenges for both computers and analysts alike:

Consider the pathologist that must identify x number of cells of a certain classification. How many should he aim for? How big should his field of view be to select from?…

Automation seems to be a solution, but here too limits crop up. Professional image analysis software is expensive, people need to be trained, and there are only so many pixels any GPU can process in any given time.

The solution comes in treating the image analysis process as a multi-phased project.

In the first phase, select fields of view and regions of interest can be prepared as annotations on top of a whole slide image. This pre-selection can be as simple as an automated algo that identifies entropy in a pixel environment, or a pathologist that carefully picks and curates regions of interest.

In other words: statistical (random) sampling is the name of the game. And our very own PMA.studio is a great solution to make these ground-truth annotations in.

Annotations in PMA.core

Whether via scripting or manual curation, annotations end up stored in PMA.core.

Internally, we store the annotations as Well-Known Text (WKT) strings, but they can be converted to several other file formats, too, including Excel CSV, Visiopharm MLD, Leica/Aperio XML, or Halo Annotation XML.

We provide several other resources regarding annotations that can provide more background:

When your annotations are part of a random sampling exercise, chances are that you’re going to want to do more downstream operations with them.

In this article we will therefore:

Use Jupyter and pma_python to interact with PMA.core
Identify geometric (polygon) annotations and examine their properties
Convert annotations to rectangular snapshots at high-resolution
Save these extracted annotations as new separate high-resolution tiled TIFF slides

Core::get_annotations()

The Core module of our SDK contains a get_annotations() function already. Let’s start by examining what we get back when we invoke it on our sample slide:

from pma_python import core
core.connect("https://srv/pma.core/", "usr", "***")
slideref = "/rootdir/slide.mrxs"
annotations = core::get_annotations(slideref)

Now we can print the first element and see what it contains. We use the pprint library to make our output look pretty:

We can immediately see the audit trail, and beyond that the most obvious element is the Geometry. As you be deduced: the geometry defines all points that make up an annotation. In our case our polygon is merely a rectangle, so we find 5 (x, y) coordinates, with the fifth one being the same as the origin. The format can be generalized and written out in a symbolic annotation that looks like this:

POLYGON((x1 y1,x2 y2, x3 y3, x4 y4,…, xn yn,…, x1 y1))

If we want to convert these annotations to snapshots, we need to determine the x y coordinates of the points that define a rectangle that contains all points of our original polygon.

In other words, for each of the x y pairs of coordinates given, we find the minimum and maximum x and y values. We can then use these to compute the width and height of the resulting (high resolution) snapshot.

Luckily this is easier to do than finding the largest rectangle within a polygon!

def annotation_to_rect(ann):
    points = ann.split(",")
    min_x = sys.maxsize
    max_x = sys.maxsize * -1
    min_y = sys.maxsize
    max_y = sys.maxsize * -1
    for point in points:
        (x, y) = point.split(" ")
        x = float(x)
        y = float(y)
        if x > max_x:
            max_x = x
        if x < min_x:
            min_x = x
        if y > max_y:
            max_y = y
        if y < min_y:
            min_y = y
    w = max_x - min_x
    h = max_y - min_y
    return (min_x, min_y, max_x, max_y, w, h)

And we can use this method to get the coordinates of the first annotation.

Core::get_region()

In earlier tutorials, we mostly stuck with extracting tiles from PMA.core. But if you want to extract arbitrary regions, you can use core::get_region() instead. The call uses the same coordinate system as used to store annotations.

Our next step then is to use these coordinates and parameters as arguments for the get_region() call.

region = core.get_region(slideref, min_x, min_y, w, h)

Without any additional parameters, get_region() automatically retrieves pixels at the deepest zoomlevel. While this is what you want, it is quite possible that your environment may be protected against such (perceived) over-zealous behavior and responds with an error:

DOS attacks are a reasonable concern of course.

The solution then is to split up the coordinates in 4 quadrants. Like this:

region11 = core.get_region("slide.mrxs", min_x, min_y, w / 2, h / 2)
region12 = core.get_region("slide.mrxs", min_x + w/2, min_y, w / 2, h / 2)
region21 = core.get_region("slide.mrxs", min_x, min_y + h/2, w / 2, h / 2)
region22 = core.get_region("slide.mrxs", min_x + w/2, min_y + h/2, w / 2, h / 2)

Once the four quadrants are loaded, a new PIL image can be constructed, and the 4 quadrants can be pasted into the respective corners.

region_combo = Image.new('RGB', (int(math.ceil(w)), int(math.ceil(h))))
region_combo.paste(region11)
region_combo.paste(region12, (int(math.floor(w/2)), 0))
region_combo.paste(region21, (0, int(math.floor(h/2))))
region_combo.paste(region22, (int(math.floor(w/2)), int(math.floor(h/2))))
region_combo.save("region.jpg", "JPEG", quality = 95, optimize = True, progressive = True)

By working with quadrants, you’re effectively creating a de facto 2 x 2 grid. If this still doesn’t work for you, you can create 3 x 3 grids, or go even more refined.

Pyramidal TIFF

What’s missing? Say that your resulting extracted high-resolution snapshot is 8K x 5K pixels in size. You can work with that kind of image in some programs, but it’s not ideal. And your resulting snapshot can be even larger than that.

The solution is to not save your PIL image in a JPEG format. Instead, to save it as a pyramidal (tiled) TIFF. Some environments, like ASAP, even require this kind of input format.

After installing the gdal library, you can use the following method to convert any PIL image object into a pyramidal (tiled) TIFF:

def PILToTiff(pilref, output_file= "pil.tif", target_quality = 80, downscale_factor = 1):
    tileSize = 512    
    tiff_drv = gdal.GetDriverByName("GTiff")
    output_filename =  output_file
    (w, h) = pilref.size
    ds = tiff_drv.Create(
        output_filename,  w,  h,  3,
        options=['BIGTIFF=YES',
            'COMPRESS=JPEG', 'TILED=YES', 'BLOCKXSIZE=' + str(tileSize), 'BLOCKYSIZE=' + str(tileSize),
            'JPEG_QUALITY=90', 'PHOTOMETRIC=RGB'
        ])

    tilesX = int(math.ceil(w / 512))
    tilesY = int(math.ceil(h / 512))
    totalTiles = tilesX * tilesY
    pbar = tqdm(total=totalTiles)
    for x in range(tilesX):
        for y in range(tilesY):
            pbar.update()

            x1 = x * 512
            y1 = y * 512
            x2 = min((x+1)*512, w)
            y2 = min((y+1)*512, h)
            
            tile = pilref.crop((x1, y1, x2, y2))
            arr = np.array(tile, np.uint8)

            # calculate startx starty pixel coordinates based on tile indexes (x,y)
            sx = x * tileSize
            sy = y * tileSize

            ds.GetRasterBand(1).WriteArray(arr[..., 0], sx, sy)
            ds.GetRasterBand(2).WriteArray(arr[..., 1], sx, sy)
            ds.GetRasterBand(3).WriteArray(arr[..., 2], sx, sy)

    pbar.close()
    ds.BuildOverviews('average', [pow(2, l) for l in range(1, 5)])
    ds = None
    print("Done; see result in ", output_filename)

When we now systematically want to convert all annotations from a set of slides into separate high-resolution pyramidal TIFF files, it’s just a matter of putting together the functions we’ve developed in this tutorial:

for slide in core.get_slides("/root_dir/path/…"):
    annotations = core.get_annotations(slide)
    ann_idx = 0
    for annotation in annotations:
        ann_img = AnnotationToPIL(slide, ann_idx)
        tif_file = "c:/output/" + os.path.basename(slide).replace(".", "_") + "_" + str(ann_idx) + ".tif"
        PILToTiff(ann_img, tif_file)
        ann_idx = ann_idx + 1

The result can be seen in PMA.start in the c:\output folder afterwards:

Ground truth

Image Analysis (IA) comes in many shapes: Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL)… What they all need: curated data to train on. Sometimes it’s as simple as feeding them all the tiles contained in a whole slide image one by one, and we’ve done a couple of examples of this already on our blog.

At times however, supervised machine learning is the name of the game. For that, a pathologist may pre-select particularly interesting looking areas of interest. In other cases, statistical random sampling may help to make an existing algorithm more robust or fine-tune it.

The resulting regions of interest (whether manually annotated through PMA.studio or automated via an environment like OpenCV) can in turn be exported again in individual high-resolution images that represent true subsets of slides.

In this article, we showed how such a complete workflow can be facilitated by our very own PMA.python SDK. The resulting dataset is at the same high-resolution as the original, and the performance of the images is just as good as what you started with.

Of course you may not be using Python, or you may just be looking for something just a bit different. Need help? Do drop us a note. We love hearing about your use case, and think about how we can help solve problems.

October 28, 2020October 30, 2020

Customizing the next generation of slide viewer software

It’s coming, it’s coming, it’s coming, it’s coming…

And we’re really excited about it!

We’re in the process of wrapping up PMA.studio. This is going to be our flagship product for the next three years. It’s got everything a microscopy enthusiast needs, ranging from powerful viewport manipulation options, over annotations, to live conferencing. Remember our earlier article about how PMA.core handles (external) slide meta-data? All of that and more is in there, too.

In this blog post, we want to give developers as well as customers in our OEM and reseller program a sneak peek at PMA.studio and specifically focus on the opportunity for custom add-on development and white labeling.

PMA.studio is currently in testing phase for various use cases already, ranging from comparative validation studies to enterprise-wide deployment as a central histological information cockpit. If you are interested in joining the effort and helping us track down last minute bugs, do shoot us an email and we can see if you’re a good fit for our beta-program.

You can find more information about PMA.studio itself in the landing page that we’re building for the software at https://www.pathomation.com/pma.studio.

Administrative console

PMA.studio has a separate interface apart for administrative tasks.

In the administrative console, you can register any number of PMA.core instances to be used by PMA.studio to retrieve slides from. Once in use, for each registered PMA.core instance, you can see what users have sought access to it.

There’s a general dialog where you can set system-wide settings, like the address to contact in case of trouble, or the company / institute logo to use. This is where white labeling starts.

Like for PMA.view 1.x, the administrative console in PMA.studio can be used to customize the ribbon interface at top of the interface. The syntax is XML. We don’t have any formal definition of the format, and typically work with customers to get the result they want.

You can play around with the XML code yourself if you want. The default toolbar provides enough example code to allow simple re-arrangements of buttons and move features between different tabs.

If you run into trouble, there’s always the option to restore the code back to a default toolbar.

Custom annotations

PMA.studio supports annotations. We have a separate tutorial on the subject as part of our beta-program.

There are situations where you don’t want users to go off and make their own annotations at will.

You may want your pathologists to annotate invasive tumor margins or necrotic areas for a study protocol. It’s useful then that people align, and everybody uses the same color scheme. So in addition to providing standard annotation tools, it’s possible to define buttons on your toolbars w/ pre-set attributes and parameters.

Here’s a great example of a project we participated in recently, for which participants had to indicate six different types of tissues / cells:

The underlaying XML for this extra ribbon tab is as follows:


<Tab label="NKI" hint="custom ribbon tab" name="nki_annotations" enabled="true" visible="true">
    <Ribbon label="Total TLS" width="30%">
        <Tool type="buttons" size="large">
            <Command name="preset-peri_mat" icon="an_m_peri_tls.png" hint="Total # mature TLS (peritumoral)" label="Peritumoral, mature"
                pma-classification="peri mature"
                pma-shape="ClosedFreehand"
                pma-color="#ffff33"
           />
            <Command name="preset-intra_mat" icon="an_m_intra_tls.png" hint="Total # mature TLS (intratumoral)" label="Intratumoral, mature"
                pma-classification="Intra mature"
                pma-shape="ClosedFreehand"
                pma-color="#cccc00"
           />
           <Command name="preset-peri_immat" icon="an_im_peri_tls.png" hint="Total # immature TLS (peritumoral)" label="Peritumoral, immature"
                pma-classification="Peri immature"
                pma-shape="ClosedFreehand"
                pma-color="#999900"
            />
            <Command name="preset-intra_immat" icon="an_im_intra_tls.png" hint="Total # immature TLS (intratumoral)" label="Intratumoral, immature"
                pma-classification="Intra immature"
                pma-shape="ClosedFreehand"
                pma-color="#ffff66"
            />
            </Tool>
        </Ribbon>
        <Ribbon label="Pre-sets" width="30%">
            <Tool type="buttons" size="large">
                <Command name="preset-necrosis" icon="an_necrosis.png" hint="Tumor" label="% necrosis"
                    pma-classification="Necrosis"
                    pma-shape="ClosedFreehand"
                    pma-color="#aa0000"/>
                <Command name="preset-fibrosis" icon="an_macro.png" hint="Tissue" label="% fibrosis"
                    pma-classification="Fibrosis"
                    pma-shape="ClosedFreehand"
                    pma-color="#00aa00"/>
            <Command name="preset-via_tumor" icon="an_tumorcells.png" hint="Tissue" label="% viable tumor cells"
                pma-classification="Viable tumor cells"
                pma-shape="ClosedFreehand"
                pma-color="#000000"/>
        </Tool>
    </Ribbon>
</Tab>

The pay-off of this kind of configuration is two-fold: the protocols are consistently executed as intended. In addition, the training needed for novel users to interact with the software is reduced because they only have a few options to choose from. Less can definitely be more in these scenarios.

Configuring the panel layout

PMA.studio organizes its content across different panels. The layout of PMA.studio is a lot more powerful and flexible than in PMA.view 1.x.

All panels can be turned on or off via the Configuration tab on the ribbon. In addition, the admin console can be used to pre-determine the layout and panel organization when people first log in.

In its simplest configuration, you’re navigating slides and looking at them one by one:

You can add various information panels as you see fit to get to something that looks more like this:

Like with the ribbon customization and pre-defined annotation controls, you can pre-set PMA.studio’s panel layout. Again, you’re looking at XML snippets.

The layout from the latter screenshot is then defined as follows:

Iframe panels

A special type of panel is the iframe-panel. With these panels, you can virtually load any website or page that you want within a pre-defined PMA.studio panel. You can define your own panel through the Layout configuration button on the ribbon:

A pop-up dialog appears and let’s you specify the title of the panel, and the URL you want it to load.

In our example we took a promotional video for our own PMA.slidebox product (more on that later), but you can put just about anything in there.

The result looks like this:

You can pre-define these custom iframe-panels any way you want through the admin console.

<Component name="IFrame" label="Pathomation video" url="https://www.youtube.com/embed/VsiGz8ykuEo"/>

Respective parameters that indicate the current state of the active viewport in PMA.studio are passed along automatically, like this:

http://www.server.com/app/page.php ?server=%PATH%TO%PMA%CORE% &path=%FULL%PATH%OF%SLIDE% &sessionId=%SESSION%ID%

Last but not least: a product pitch

We wrote earlier about PMA.slidebox.

This week, we’ve officially started promoting this product. We updated the landing page for PMA.slidebox, and the first demonstration videos are available through our YouTube channel.

Like with everything we do, we’re very excited about being able to package a ton of useful functionality in nevertheless compact product offering. Do have a look at our demo portal for PMA.slidebox and let us know if this something you, too, are interested in, too!

October 8, 2020October 30, 2020

How to handle slide meta-data?

What is slide meta-data?

In order to know to interact with something, we first need to know what it is. Or, at least, we need you to know what it means for us. Just so we’re clear what we’re talking about.

For the sake of this article, we distinguish three different kinds:

Intrinsic meta-data: information stored within a slide’s file format. A trivial example is the slide’s pixel size, and a more advanced feature is the time it took the scanner to produce the slide.
User-captured meta-data (forms): Pathomation’s central tile server component, PMA.core, allows for the definition of table structures. Forms are a basic data structure in PMA.core that’s picked up again and used to capture and present user-attributed data in other several Pathomation platform components, including PMA.studio, PMA.slidebox, and PMA.control.
External data: if you have tons of external data in a separate repository and want access to it through the Pathomation software platform for digital pathology and virtual microscopy, you can link to it in real-time.

Below you can find an overview of what kind of meta-data is supported by what component and in what capacity

	Slide info	Forms	External data
PMA.start	Read	Not supported	Not supported
PMA.core	Read	Define / read	Read
PMA.view	Read	Not supported	Not supported
PMA.studio	Read	Read / write	Read
PMA.slidebox	Read	Read / write	Read
PMA.control	Read	Read / write	Read

Slide info

Intrinsic slide meta-data can be shown in PMA.start by clicking on the filename in the viewport.

In the upcoming versions of both PMA.view and PMA.studio, we provide a separate info-panel for this.

There can be more slide info available than is actually shown in our various front-end interfaces. Some information is really specific, too specialized, or irrelevant even to show to most users. Other fields are scanner-specific and don’t make sense to include on a systematic basis.

At the API and SDK level, we provide dedicated SlideInfo calls that return full hierarchical dictionary structures that do give exhaustive slide information and can be consumed as you see fit for your specific application or workflow.

Data with a twist

Forms are defined in PMA.core through the form editor.

PMA.core’s forms support trivial datatypes like text or numbers (of course). We also offer scientifically relevant twists on traditional data capture. An example is that numerical data fields allow the option to be recorded as “below detectable limit”, since it’s very hard to prove that something is not present in a sample. Oftentimes, a numerical zero just means that our detection apparatus just lacks the sensitivity the unmask the presence of a specific phenomenon.

What can you do in forms? Pretty advanced things. We can model the CAP-recommended cancer protocols as PMA.core forms.

PMA.core offers different ways to interact with forms, but it doesn’t provide a data entry module. That’s because it doesn’t really make sense to enter data at this level. Data entry is done in the context of an application or workflow. Our upcoming PMA.studio will support data entry, and PMA.control offers several interaction modes. Underneath, PMA.control interaction modes rely on data stored as PMA.core forms.

You can ask PMA.core to generate a spreadsheet template based on a select folder of slides and a particular already defined form.

At Pathomation, we pride ourselves to be a truly open platform, so PMA.core offers several data formats to export captured form data to, including CSV, XML, and ARFF.

Use cases for external data

Imagine that you organize a toxicology experiment with a rodent population. In a separate database, you’ve been keeping data about each specimen’s vital statistics, dietary and behavior observations, as well as their phenotypic expression and genotypic make-up. Observations happen daily, so that’s 1750 new records per week.

You take weekly biopsies of the animals to monitor their response to a new drug you’re testing. The biopsies are prepared and stained in triplicate (good for a total of 750 new slides per week), and each slide has a barcode that can be used to trace back to the original individual animal. Your slide scanner takes care that the barcode is encoded in the slide’s filename.

Both the slide population and the separate database keep evolving. Replication between your database and a PMA.core form is considered at some point, but deemed inefficient and error-prone, because we are talking about experimental and evolving data structures here.

The solution is to define an external data source in PMA.core. This goes in two steps: first a connection string is defined to connect to the database server.

Next, the external connection is used to formulate any number of queries against.

Data can be previewed within PMA.core, and subsequently is automatically propagated to other environments like PMA.studio.

External data are everywhere. They can be in proprietary databases as in our example above, but they can also be in a (AP)LI(M)S, VNA, PACS, or EHR system. In all those cases, replication is hard and impractical at best, and can even lead to data inconsistencies and errors at worst.

The Pathomation platform for digital pathology and virtual microscopy allows for a more elegant solution.

July 8, 2020July 8, 2020

What WSI data are REALLY made of

Image data types

Pathomation is concerned with any (imaging) data that is microscopy- or pathology-related. Much of the data is large: we talk about gigapixel data, or (more apt for microscopy) whole slide images (WSI).

Not all image data accessed via Pathomation need be large though:

Microscopic images can represent individual fields of view, specific areas of interest captured by a mounted camera and for discussion or other ad hoc purposes
Pathology starts from physical tissue. Therefore, it is often useful to photograph the obtained tissue. These are typically referred to as “macroscopic images”.

Oftentimes the above results can be stored in common (image) file formats: JPEG and TIFF are most often encountered. You can distribute these images via any medium. But if you have them side by side with your high-resolution images representing prepared slides, it’s reassuring to know that with Pathomation software you can organize these different slides under one umbrella.

Back to the really big data. Elsewhere on this blog, we have an article about the technical challenges when wanting to store an image that contains 100,000 x 200,000 pixels.

We send said article to (potential) customers regularly. Some are helped by it, some not. Because, understandably, many times you just want to know about your slides. When you book a flight, you don’t want to get an explanation about Newton’s or Bernouilli’s laws either… Just get me the tickets please.

So why did we write the article in the first place? Because at Pathomation, we pride ourselves at being format-agnostic. We don’t pick hardware vendors. Each scanner has their pros and cons. Each comes out to server a specific market segment and works better with some tissues than others. It’s not our place to subjectively decide who’s better or worse (although we would appreciate better feedback from some with respect to a vendor’s specific file format).

So please do understand that mentioning specific vendors has nothing to do with any positive or negative endorsements for said vendor. We’re merely stating facts with respect to the way how their techies have long time ago decided to organize their (giga)pixels.

Single or many files

As you already know from the previous article, virtual slides can consist of many different files. Within the files that represent MRXS slides, you find plenty of .dat files, for VSI slides you find .ets files (amongst others) etc.

Several vendors have adopted a single file format for their WSI data. These include Hamamatsu (NDPI), Aperio (SVS), Leica (SCN), and Zeiss (CZI and ZVI).

Other vendors that have adopted a multi-file approach include 3DHistech (MRXS), Olympus (VSI), and Motic (MDS).

The organization of data across multiple files for different vendors is not standardized. In the case of 3DHistech, individual files more or less represent different magnifications. Olympus’ file structure seems more organized around scanned regions of interest.

The single file formats can also be upgraded to multi-file formats. Hamamatsu scanners can create .ndpis files for fluorescent or z-stacked content. The “s” stands for “set”: the .ndpis file merely contains pointers to individual .ndpi files which then each contain the image for the particular layer or channel. You can open these .ndpi files by themselves by the way, but they’re only useful when you also correctly interpret their context from the .ndpis file.

Aperio SVS files can be accompanied by .xml files, which contain annotations. Ventana BIF files can be accompanied by .tifp and .bmp files.

As soon as you have more than 1 file involved, it becomes a multi-file file-format. The strict distinction is important, because some storage systems don’t support multi-file containers.

Examples

Let’s look at the 3DHistech line of scanners: In the screenshot below we scanned two slides: HE and PDL1. The software ends created “HE.mrxs” and “PDL1.mrxs” files, along with “HE” and “PDL1” subfolders.

Within the subfolders, a standard naming convention it used, so you won’t find any more “HE*” files in there.

The .mrxs files themselves are typically just there to allow third-parties to identify these different file-types. Case in point: the .mrxs file can be renamed into a .jpeg file, and then you can just view it as any other image (it’s a quick way for us, too, to display slide thumbnails).

Another example? You got it!

This is what the structure of a .VSI slide looks like:

Each scanned region translates into a “stack”; each stack can contain one or multiple frames.

A different approach, a more hierarchical structure.

For DICOM slides, all related files are grouped together. No subdirectory required:

The details of these for the typical end-user don’t matter, except that they do matter when you want to copy / move/ transfer slides to others. The basic principle here to always make sure you not copy the index .vsi / .mrxs / .whatever file, but also all the accompanying data-files. Zipping them may help both you and your receiving party.

How Pathomation can help

Have a look at our [slides] folder in the Windows Explorer:

Confusing, right?

Now let’s have a look at the same folder through PMA.start eyes:

Since Pathomation knows slides, we systematically hide the intricacies of respective formats that you shouldn’t to worry about. In PMA.start it becomes obvious which subfolders are true subfolders, and which ones are merely there to support vendors’ data structures.

This doesn’t help you yet to transfer slides of course, but if you are one of our commercial users, you typically want to transfer slides from your local system to PMA.core and back. If that is you, then PMA.transfer is a great free tool (it’s part of your license package) for you to look at.

This is how our [slides] folder shows in PMA.transfer:

PMA.transfer seamlessly interfaces with PMA.start, and uses it as a jumping board to transfer slides between different endpoints. The biggest benefit of PMA.transfer therefore is that it encapsulates format-specific complexities and hides them from the end-user. Through PMA.transfer, you’re truly manipulating slides instead of files.

In addition, PMA.transfer also makes sure that only correct slides are transferred (nothing more frustrating than transferring 2 GB of data over Wifi, only to discover that the source was corrupt), as well as confirming that the transfer was completed successfully (probably the second most common source of frustration in the endeavors).

Why it matters

At Pathomation, we take much care designing our components in such a way that data duplication can be avoided at all costs. In PMA.control e.g., you can create as many cases and case collections as you want, but the data always remain in the same location. In PMA.core, you can create nested root-directories, each with different ACL properties. To your end-users these end up looking like different filesystems, but they’re really not. At the API level, we provide the possibility to fingerprint a slide, so you can scan for duplicate files and possibly eliminate them. All of these measures matter when you’re talking about Terabytes of data.

But there comes a time when you do need to copy slides. Perhaps you’re moving them from one installation to another, or there’s a network upgrade, or you just want to ship off some slides to a colleague (My Pathomation can now do this for you, too).

Whatever the case, (virtual) slides will need to be moved around. By their nature, they are big. It helps to have some understanding about how they are structured at that point.

And now you know…

… everything about whole slide images, or, at least, almost everything.

Amongst other things, .mrxs-, .vsi-, and other files help companies like ourselves decide what kind of file format is being used. The alternative would be that we find a large number of subfolders, and we have to parse each and everyone of those folders in a variety of ways to try to “guess” what file format it belongs to (if this is even the case at all; a subfolder can still be a regular subfolder, containing no slides at all). This would be a tremendous drag on system performance.

Whether you deal with Pathomation or another vendor, we hope this article has helped you take a peek behind the curtain of what whole slide images are made of, and how to best work with them.

Of course, if you are using the Pathomation platform, you should have a look at PMA.transfer, a no-hassle tool we developed to facilitate slide transfers.

If you’re not yet a Pathomation customer (whaaaaaaaaaat??), you can contact us for a free no-obligation demonstration.

June 22, 2020June 30, 2020

I “just” want a slide catalog

The case for simplicity

Sometimes user requests are simple. In this particular instance, we had a customer that “just” wanted a list of all of their slides (within a particular root-directory).

We pointed them to the repository of PMA.control:

We pointed them to the tree interface that combines folder and slides in PMA.view:

But as it turned out, these were too complicated. The customer already had built a nested hierarchy of folders and subfolders, but now wanted a linear list of all slides across all folders. A thumbnail next to each slide reference would also be useful, thank you very much.

A linear list of slides

Here’s how we can create a linear list of slides:


from pma_python import core
from datetime import date
from os import mkdir
from os.path import exists

core.connect()

slides = core.get_slides("C:/slides", recursive=True)
print(len(slides), " slides found")     # sanity check

f = open("c:/wsi_report/cat1.html", "w+")
f.write("")
f.write("Market slide catalog created on " + str(date.today()) + "")
f.write("")
for slide in slides:
    f.write(slide + "
")
f.write("")
f.write("")
f.close()

Want to include the thumbnail? Look no further than the get_thumbnail_url method:


f = open("c:/wsi_report/cat2.html", "w+")
f.write("")
f.write("Market slide catalog created on " + str(date.today()) + "")
f.write("")
for slide in slides:
    thumb = core.get_thumbnail_url(slide)
    f.write("" + slide + "
")
f.write("")
f.write("")
f.close()

Ah crap, that looks horrible!

No worries; just add some formatting to the ole’ <img> tag:


f = open("c:/wsi_report/cat3.html", "w+")
f.write("")
f.write("Market slide catalog created on " + str(date.today()) + "")
f.write("")
for slide in slides:
    thumb = core.get_thumbnail_url(slide)
    f.write("" + slide + "
")
f.write("")
f.write("")
f.close()

Yes, something like this:

Much better!

But there’s a catch here: careful observers notice that the thumbnail URLs used by the above code have a PMA.core Session ID embedded. This means that those URLs are only valid as long the respective Session IDs remain valid.

This is fine for ad-hoc reporting, but if we want a list that we can post somewhere on a central server as a reference source for others, we need something just a little more sophisticated. Yes, the keyword in this post is “just”, just in case you’re wondering.

Creating a persistent slide catalog

We want to create a list of all slides in our repository, and we want to list to be persistent. In other words, it is not ok to walk away from our browser for a couple of hours, refresh our list, and see the following:

The solution is to not use the thumbnail URL, but retrieve each thumbnail as a binary object and save it to a subfolder. Then, we let our <img> tag point to the downloaded (or cached, if you will) files.


dir = "c:/wsi_report/cat4/"
if not exists(dir): 
    mkdir(dir)
f = open("c:/wsi_report/cat4.html", "w+")
f.write("")
f.write("Market slide catalog created on " + str(date.today()) + "")
f.write("")
for slide in slides:
    thumb = core.get_thumbnail_image(slide)
    fn = core.get_slide_file_name(slide) + ".jpg"
    thumb_fn = dir + fn
    print("Saving thumbnail as ", thumb_fn)
    thumb.save(thumb_fn)
    f.write("")
    f.write("")
    f.write("" + slide + "
")
f.write("")
f.write("")
f.close()

This catalog fits our needs. We can post it anywhere, and it only relies on local data. You can store it on your hard disk. You don’t need PHP, you don’t need Python, you don’t need the underlying PMA.core (PMA.start in our example code) to be up and running.

There are cons to our approach, too:

The catalog takes longer to generate, as we need to download a large number of thumbnails one by one
The catalog takes up more space. In our case, for 265 slides, we went from 197 KB to 20+ MB. That’s still manageable to send in a zipped file package, but for larger repositories may become inconvenient.
The catalog is a snapshot of the repository. If you use as a reference today and add slides to the underlying root-directory tomorrow, the catalog will not pick up the newly added (or removed, for that matter) slides, unless you re-generate the catalog and re-distribute or publish it.

And finally

Every solution has trade-offs, and even for simple problems, you have think through things to come to the right solution. Problems that have the word “just” somewhere in their formulation can be particularly trickly.

The sourcecode for this post is available through our repository as realdata 038 – simple slide catalog.ipynb. Feel free to download it and adjust it for your own needs.

As always: we encourage interaction. Do let us know what digital pathology problem or scenario you want us to work out in one of our next posts!

April 29, 2020April 29, 2020

Optimizing your data volume

Pop quiz

Let’s start this post with a pop-quiz. Without peeking ahead; can you identify what painting it is?

The answer of course is that it’s this one. The question is relevant, because you just proved to yourself that you can get by with very little information to identify something big and important. So let’s just walk you through the way we created the picture above:

We took the original image from Wikipedia, provided through the WikiMedia Commons library, at 585 x 870 pixels (a total of 508,950 pixels). The physical size of this image was 957 KB.
We then resized this image to thumbnail size of 27 x 40 pixels (a total of 1,080 pixels, or a reduction by 99.8%). The physical size of this image was around 3 KB (a reduction by 99.7%).
We then blew up the thumbnail again with a factor 10 to 270 x 400 pixels (bringing the total amount of pixels again to 108,000 or about 21% of the original). Interestingly enough, the image size this time is 75 KB, which represents only about 8% (even though we supposedly offer 21% of the original pixels).

The important takeaway is that the human brain is remarkable in filtering out and handling noisy, blurry data.

But what does this have to do with digital pathology or virtual microscopy?

A bottleneck

Recently we were contacted by a company that was involved in telepathology in rather rural areas. It’s a great endeavor: they provide local staff with refurbished scanners, thereby hoping to extend advanced pathology expertise to communities that have no other way of having access to these.

But as anybody working with whole slide images can attest: they’re big. Huge sometimes, And there’s no way around this. And it’s a problem for our contact because rural areas unfortunately oftentimes also still mean limited bandwidth. It can take hours to transfer a single slide, and that’s just not practical.

This is where our earlier exercise comes into play. And the question now becomes: How much information do you really need to make an accurate diagnosis?

If we scan a slide at 40X (0.25 ppm according to DICOM’s definition) and we brought it back down to 20X (0.5 ppm), would it be sufficient for a pathologist to make an accurate diagnosis? And what about image compression? Do we need 100%? What if we transferred the image with 90% quality? 80%? We know that at 50% compression artifacts will not result in a pretty image, but the real question is whether the receiving pathologist can still make a diagnosis.

Some research has done on this: Yukako et al concluded that even significant compression ratios have no impact on diagnostic accuracy.

So with that being said, how could you do it? And how would you do it with the Pathomation platform?

To Jupyter, capt’n!

A whole slide image has two properties that we can manipulate to control the physical slide size:

The highest zoomlevel contained in the slide (see also our blog post of WSI structure)
The compression ratio of individual tissue tiles

We can then create a jupyter script can takes in these two parameters, to convert any given slide to an intermediate format. We choose TIFF here, but you can adapt the output to your own needs.

The first one is target-quality, and can be given a value of 0-100 (100 being the highest quality).

The second one is the downscale factor. It can be chosen from a list of values of [1, 2, 4, 8, 16, 32, 64, 128]. That doesn’t quite translate to an optical magnification or pixels per micron reading, but we use it here for the sake of simplicity (it has to do with the way these WSI data are typically structured). The higher the downscale factor, the less magnification and detail you will end up with. If you want to, you can write your own conversion method from downscale-factor to ppm and back again.

After setting the parameters, we create new TIFF file using the GDAL TIFF driver. The width and height of the final tiff is based on number of tiles horizontally and vertically.

We read each tile of the final zoomlevel (1:1 resolution) from the server and write it to the resulting TIFF file Then we create the pyramid of the file using BuildOverviews function of GDAL

The complete Jupyter notebook is available from our website.

The jury is in session

Let’s see what the result of our code is for different combinations.

We downloaded CMU-1.svs from the OpenSlide sample data repository, which has a compression quality of 30%. The original file size is 169 MB.

When we run our script and ask to generate a derived TIFF slides with 100% tile quality and the same magnification level, we find something interesting: the new slide is about 1.5 GB is slide. It became bigger!

This is our first lesson then: when doing these kind of conversions, it doesn’t make any sense to transfer data at a compression rate that’s lower than the original slide’s compression.

The CMU-1.svs slide is an extreme case. We haven’t heard of any labs that are scanning at only 30% quality; most scanners are usually calibrated to produce data at 70%-80% compression quality.

That being said, let’s just use the 1.5 GB as a reference point and see how the data package becomes smaller as we vary both the compression quality and the downsampling parameter:

Not surprisingly, as the scaling factor goes up (a scaling factor of 4 would roughly correspond with a perceived magnification of 10X), the resulting slide size becomes significantly smaller.

So in both axes, significant saving are to be had. Similar to the Mona Lisa painting example that we started with; the slide at 10X and 60% quality is only 2.3% the filesize of the original 40X and 100% quality file.

What else?

We wrote this article to give you ideas on how you can employ PMA.start to optimize your data volume when shuttling back data back and forth between two sites.

This is but one scenario to consider. Is the eventual resolution sufficient to maintain diagnostic accuracy? That’s up to you and your pathologist(s) to resolve. Whether the parameters specified here work for you, is a question you can only answer.

To give you an idea; here’s the slide with a downsampling rate of 1, and 100% compression quality:

Here’s the same slide with a downsampling rate of 4, and only 60% compression quality:

There is one more interesting experiment to consider here: storage for digital pathology doesn’t scale very well, and we would be interested to see whether machine learning algorithms (ML/DL/AI) could be trained equally well on heavily compressed data, compared to uncompressed data.

Full disclosure here: There are things missing here, too. We’re not incorporating the thumbnail image in the exported TIFF, in similar fashion as certain other vendors do this. It’s possible, but again, the eventual implementation depends on your personal preference and circumstances. If you’re looking for help and advice with this, we can assist you and you may contact us at slides@pathomation.com.