My first 6 months @Pathomation

Or the story of me, being thrown to the lions several times.

Digital pathology, a topic I had never heard of six months ago (can you imagine?!) is now one of the most important subjects in my life.

To be honest, I never thought I would have gotten the job at Pathomation, but even though I did not have any experience they still believed in me (or they just gave me the benefit of doubt of course). And for sure I’m glad they did.

I still go to school and before Pathomation, I worked at a local supermarket, which is nice but not very challenging. Now I have to step out of my comfort-zone everyday, doing things I’ve never done and trying to be the best and smartest version of myself, since I’m surrounded by pathologists, scientists etc… Believe me they are a challenge in themselves, but a challenge I’ve accepted happily, because working with people that are so skilled and have such great knowledge is enlightening and motivating.

Step by step and day by day, I’m learning more about medical imaging and the whole world around it. Not only by working from the office, but also abroad. During the first week of January I went to Morocco to work side by side with our Moroccan developers. This was my first work trip ever!

Unfortunately, it was (probably) also the last one of this year, due to unseen circumstances (covid-19, anybody?). I’m waiting for the next trip, even though I’m secretly happy with that little bit of extra time to mentally prepare myself for a medical conference… (which will also be my first one of course)

You might be wondering what I actually do in this company, since I’ve been bragging so much about it in this post. Explaining that in only one paragraph is quite hard, but I’ll try to keep it short.

First of all, I want you to understand, that there’s a variety of things that I actually do, and an array of things that I try to do. All paperwork related things, like bookkeeping, invoicing and preparing estimates, that’s on me! I also try to help Yves by testing our software and giving him my opinion from a user perspective (and hopefully inspire him for new ideas). In actual fact, I try to ease the work of all my colleagues where I can. Because of this, my job contains a wide selection of different tasks, and I’m still exploring what I like to do most.

As you can see, I get lots of chances where I can grow my courage and improve my social skills. Therefore, I’m glad I can be part of this small but growing team!

Optimizing your data volume

Pop quiz

Let’s start this post with a pop-quiz. Without peeking ahead; can you identify what painting it is?

The answer of course is that it’s this one. The question is relevant, because you just proved to yourself that you can get by with very little information to identify something big and important. So let’s just walk you through the way we created the picture above:

  • We took the original image from Wikipedia, provided through the WikiMedia Commons library, at 585 x 870 pixels (a total of 508,950 pixels). The physical size of this image was 957 KB.
  • We then resized this image to thumbnail size of 27 x 40 pixels (a total of 1,080 pixels, or a reduction by 99.8%). The physical size of this image was around 3 KB (a reduction by 99.7%).
  • We then blew up the thumbnail again with a factor 10 to 270 x 400 pixels (bringing the total amount of pixels again to 108,000 or about 21% of the original). Interestingly enough, the image size this time is 75 KB, which represents only about 8% (even though we supposedly offer 21% of the original pixels).

The important takeaway is that the human brain is remarkable in filtering out and handling noisy, blurry data.

But what does this have to do with digital pathology or virtual microscopy?

A bottleneck

Recently we were contacted by a company that was involved in telepathology in rather rural areas. It’s a great endeavor: they provide local staff with refurbished scanners, thereby hoping to extend advanced pathology expertise to communities that have no other way of having access to these.

But as anybody working with whole slide images can attest: they’re big. Huge sometimes, And there’s no way around this. And it’s a problem for our contact because rural areas unfortunately oftentimes also still mean limited bandwidth. It can take hours to transfer a single slide, and that’s just not practical.

This is where our earlier exercise comes into play. And the question now becomes: How much information do you really need to make an accurate diagnosis?

If we scan a slide at 40X (0.25 ppm according to DICOM’s definition) and we brought it back down to 20X (0.5 ppm), would it be sufficient for a pathologist to make an accurate diagnosis? And what about image compression? Do we need 100%? What if we transferred the image with 90% quality? 80%? We know that at 50% compression artifacts will not result in a pretty image, but the real question is whether the receiving pathologist can still make a diagnosis.

Some research has done on this: Yukako et al concluded that even significant compression ratios have no impact on diagnostic accuracy.

So with that being said, how could you do it? And how would you do it with the Pathomation platform?

To Jupyter, capt’n!

A whole slide image has two properties that we can manipulate to control the physical slide size:

  • The highest zoomlevel contained in the slide (see also our blog post of WSI structure)
  • The compression ratio of individual tissue tiles

We can then create a jupyter script can takes in these two parameters, to convert any given slide to an intermediate format. We choose TIFF here, but you can adapt the output to your own needs.

The first one is target-quality, and can be given a value of 0-100 (100 being the highest quality).

The second one is the downscale factor. It can be chosen from a list of values of [1, 2, 4, 8, 16, 32, 64, 128]. That doesn’t quite translate to an optical magnification or pixels per micron reading, but we use it here for the sake of simplicity (it has to do with the way these WSI data are typically structured). The higher the downscale factor, the less magnification and detail you will end up with. If you want to, you can write your own conversion method from downscale-factor to ppm and back again.

After setting the parameters, we create new TIFF file using the GDAL TIFF driver. The width and height of the final tiff is based on number of tiles horizontally and vertically.

We read each tile of the final zoomlevel (1:1 resolution) from the server and write it to the resulting TIFF file Then we create the pyramid of the file using BuildOverviews function of GDAL

The complete Jupyter notebook is available from our website.

The jury is in session

Let’s see what the result of our code is for different combinations.

We downloaded CMU-1.svs from the OpenSlide sample data repository, which has a compression quality of 30%. The original file size is 169 MB.

When we run our script and ask to generate a derived TIFF slides with 100% tile quality and the same magnification level, we find something interesting: the new slide is about 1.5 GB is slide. It became bigger!

This is our first lesson then: when doing these kind of conversions, it doesn’t make any sense to transfer data at a compression rate that’s lower than the original slide’s compression.

The CMU-1.svs slide is an extreme case. We haven’t heard of any labs that are scanning at only 30% quality; most scanners are usually calibrated to produce data at 70%-80% compression quality.

That being said, let’s just use the 1.5 GB as a reference point and see how the data package becomes smaller as we vary both the compression quality and the downsampling parameter:

Not surprisingly, as the scaling factor goes up (a scaling factor of 4 would roughly correspond with a perceived magnification of 10X), the resulting slide size becomes significantly smaller.

So in both axes, significant saving are to be had. Similar to the Mona Lisa painting example that we started with; the slide at 10X and 60% quality is only 2.3% the filesize of the original 40X and 100% quality file.

What else?

We wrote this article to give you ideas on how you can employ PMA.start to optimize your data volume when shuttling back data back and forth between two sites.

This is but one scenario to consider. Is the eventual resolution sufficient to maintain diagnostic accuracy? That’s up to you and your pathologist(s) to resolve. Whether the parameters specified here work for you, is a question you can only answer.

To give you an idea; here’s the slide with a downsampling rate of 1, and 100% compression quality:

Here’s the same slide with a downsampling rate of 4, and only 60% compression quality:

There is one more interesting experiment to consider here: storage for digital pathology doesn’t scale very well, and we would be interested to see whether machine learning algorithms (ML/DL/AI) could be trained equally well on heavily compressed data, compared to uncompressed data.

Full disclosure here: There are things missing here, too. We’re not incorporating the thumbnail image in the exported TIFF, in similar fashion as certain other vendors do this. It’s possible, but again, the eventual implementation depends on your personal preference and circumstances. If you’re looking for help and advice with this, we can assist you and you may contact us at slides@pathomation.com.

More business intelligence

The story of the three Qs

In this article, we explain how Pathomation was recently able to assist its one of its customers with Performance Qualification tests for a new slide scanner.

At our customer’s lab, each piece of equipment that they take into operation, goes through a rigorous qualification pipeline before putting it to work for day-to-day lab activities.

This goes for slide scanners as well.

The qualification pipeline (the “validation procedure”) consists of three steps:

  • Installation Qualification (IQ)
  • Operational Qualification (OQ)
  • Performance Qualification (PQ)

The first step is straightforward: you’re asking yourself if the scanner can scan your slides or not. Is the type of glass slide that your lab use, in combination with the coverslip, compatible with the new scanner? Is the barcode on the slide label scanned properly, too? What about (automated) tissue detection?…

The second step is operational qualification: can the new equipment handle the edge parameters of your day-to-day operations? If you plan to feed the scanner up to 400 slides per day; will that work?

The third step in the validation can be somewhat subjective, and it’s here that Pathomation’s software platform came in particularly handy.

Performance Qualification

Performance Qualification (PQ) is a test procedure that takes place in order to verify if a new piece of equipment is good enough to be put to work in the daily workflow of a company. It happens after the hardware has already gone through IQ and OQ.

Our customer recently wanted to know how their various scanners compare to a new one in terms of speed. They took a representative set of slides, divided it in three groups (“requests”), and then put each scanner to work.

their various scanners compare to a new one in terms of speed. They took a representative set of slides, divided it in three groups (“requests”), and then put each scanner to work.

In the vendor’s viewer software, one of the parameters that could be seen was “scanning duration”. Considering the side of the dataset however, opening each slide individually in the viewer software, noting down the scanning time, recording it into an Excel sheet etc., would have been a very tedious task.

So they turned to Pathomation for help: Can Pathomation’s software be used to create a table with all scan duration values for all scanned slides?

Our API

Pathomation’s API offers a get_slide_info() call. We used it in our first article on business intelligence to extract specific bits of information with it. The method be default returns a nested hierarchy of information.

But not every scanner exports the same information. Some of the slide information we expose ourselves through ImageInfo, some not. This is because vendor 1 exposes (A, B, C, D, E), vendor 2 exposes (A, B, F, G, H), vendor 3 exposes (A, B, C, I, J, K, L) etc. Therefore, Pathomation offers the common denominator information only, something like (A, B, C).

In order to maintain some standardization across the different vendors, part of the returned information by get_slide_info is a MetaData array that contains key-value pairs that may or may not be provided by your scanner vendor.

As it turns out, we didn’t have the scanning time in there yet, but because of the already provided structure, it was straightforward to add it.

Like our first business intelligence exercise, we then write a wrapper method to extract the scanning duration as we need it:

def get_scanning_duration(slide_ref):
    info = core.get_slide_info(slide)
    meta = info["MetaData"]
    for meta_el in meta:
        if (meta_el["Name"] == "ScanningDuration"):
            return meta_el["Value"]
    return -1

The remainder of the script is straightforward:

  • Loop over the different scanner output
    • Foreach scanner (“request”), get all the slides (recursively in our case)
      • Foreach slide, extract the scanning duration
  • Wrap all output into a Pandas dataframe structure
  • Export the Dataframe to a spreadsheet

A word about that last step. On occasion, people have called us old-fashioned for this one. Surely Excel is spread wide and far enough by now so that it can be considered a de facto “standard” file format, too, can’t it?

I disagree. I still prefer to use csv instead of Excel. Why? Because csv files are simple, and transportable to many other platforms and applications. Our data in this case consists of a single table with three columns. It’s simple. We don’t need a complex data format to store this kind of data.

Generating fancy file format output is not part of the assignment here. Keep It Simple.

Our final code looks like this and can be downloaded as a Jupyter notebook, so you can play around with it yourself.


server = "http://***/***"
user = "***"
pwd = "***"
print("Session initiated ", core.connect(server, user, pwd))
requests = ["RQ105", "RQ204", "RQ695"]
base_folder = " Images/pq"
print(len(core.get_directories(base_folder)), " subfolders detected in root base folder", base_folder)
s_times = []
for req in requests:
    print(base_folder + "/" + req)
    for slide in core.get_slides(base_folder + "/" + req, recursive=True):
        s_times.append({"slide": str(slide).replace(base_folder, ""), "scan_time": get_scanning_duration(slide), "request": str(req)})
scan_times = pd.DataFrame(s_times, columns=["request", "slide", "scan_time"])
scan_times.to_csv("scanning duration.csv")

What about the results?

We can import the resulting CSV file in Excel and see what it looks like:

As we’re interested in comparing the different scanners to one another, one direct way to do this is with a pivot-chart.

These are the results. What do you think? Do all three scanners perform equally? Does the new scanner pass performance qualification (PQ)?

Your challenge here

This is how Pathomation works with its customers.

Our philosophy has been and remains to develop local, and then scale as you handle more complex scenarios. You do not need the commercial version of PMA.core to get to work with the code in this post. The Jupyter notebook that comes with this blog post is suited for use for both PMA.start and PMA.core.

Do you use Pathomation software in your daily workflows already? Tell us your business intelligence challenge and perhaps we’ll address is in an upcoming post!

A look at PMA.slidebox

So what if you “just” want to show your slides?

Education is one of the main application domains of digital pathology. And there are many instances where you just have a couple of slides that you want people to look at. When Pathomation first became involved in deploying its software for facilitating seminars, we used PMA.view.

But while it’s possible to do this, PMA.view is not a good solution for this particular problem:

  • People still need to login in PMA.view
  • PMA.view requires telling people where to navigate to (which root-directories / paths); your root-directories on the PMA.core side of things may not exactly reflect the content that you want people to see.
  • PMA.core is folder-based navigation, and PMA.view is too. This means that the concept of a case (a group of slides belonging to a patient or experiment) is not intuitively represented
  • neither PMA.view nor PMA.core support any of the visual cue-elements that we’ve all gotten accustomed to in recent years such as avatars.
  • The learning curve of PMA.view is still too steep for people that just need to look at slides. PMA.view is overshooting for what you want people to actually experience

How did people used to do it? They traveled to conferences with a slidebox in their hand luggage. Within the slidebox; neatly organized slides, sorted by case. We don’t want to be sensational here and say that the slides got lost all the same, or that they broke all the time, or got confiscated by security in a post-9/11 world (glass slides + ninja pathologist = impromptu shiroken?).

But: things could happen when traveling with physical slides, and the most likely issue was probably still somebody forgetting to take their slides with them in the first place!

Also: when traveling with physical slides you depend heavily on the organization’s talent of providing and calibrating multi-headed microscope equipment. As more people attend, aligning all the optics of these becomes ever harder, and for large groups this is just impractical.

Cue PMA.slidebox

So we got thinking… If people are used to physical slideboxes, why not just make a virtual slidebox? This is exactly what PMA.slidebox is and does!

PMA.slidebox, like all our software, relies on PMA.core. It means that you can use all the great features of PMA.core (different root-directories, access control), without having to explain it to your audience.

All your audience sees, without to register or login or need to install or download anything (zero footprint) is this:

So how does it work? PMA.slidebox shows up to four collections in the top-left corner of the screen (screenshot only shows three). When you select a collection, you see the “cases” appear underneath it, along with the slides for each “case”.

We put the word “case” between quotes deliberately because you don’t have to set it up this way. You can have a simple list of slides without any hierarchy or structure to it, and PMA.slidebox will pick it up. Similarly, if you only have a couple of cases, you could turn those into individual collections, and present them as such.

What you want to show and how you want to show it is completely up to you. It’s just like a real slidebox: you put the slides in that you want, and you organize them the way you want them, too.

Want to give it a go yourself? Here’s an example how such a virtual slidebox works in practice: http://host.pathomation.com/p0022_slidebox/

Configuration

PMA.slidebox is flexible. It is hosted on a website, somewhere (can be on your infrastructure, or on ours). If you don’t have PHP, we can configure it for you; but if you do, you can configure everything by yourself via a configuration panel.

What you first need to do is decide how you want to have everything structured. You can build a hierarchy of up to three deep, with the following levels:

As you have only 4 cases in the above screenshot, you could simplify your hierarchy like this:

While PMA.slidebox is flexible, we should point out that it is necessary to have some kind of hierarchy at least. You cannot just dump all your slides into a single folder and expect the software to figure it out from there.

Note: if you do have large repositories of slides, and you want structured case creation and organization, you should have a look at PMA.control.

Here’s what the setup looks like when your spread your cases across only two collections:

And here’s what that same group of slides looks like, but this time with each case being defined as its own separate collection:

In closing

With our PMA.slidebox product, Pathomation solves the problem of mass-distribution of slide collections. When you just want to share your slides with people in a somewhat organized (collection and cases) fashion, PMA.slidebox is the perfect solution for you. The easy to use configuration panel behind the front-end makes it a breeze to point to the exact content that you want to display, and under no circumstances does the end-user have to do anything else except click on the URL that you provide them with.

Find out more about PMA.slidebox at our website at http://www.pathomation.com/pma-slidebox.

Pathomation on the web

Web presence

We want talk a bit more about our different communication channels this week.

When you’re reading this article, you’ve obviously found one of them.

The RealData blog at http://realdata.pathomation.com is a wordpress website that we set up a couple of years ago to allow us to communicate about or explain topics that don’t necessarily have a dedicated place yet on our “main” company website at http://www.pathomation.com.

Pathomation is a small company, and things can move quickly. We simply don’t have time to re-do our website each month or so because our product offering changes, or because there’s a spurt in creative writing that needs to find a landing spot and reach an audience. A free-form blog then seemed like a good idea.

And we still think it is 😊

There are companies whose website is a blog, but we do think there’s still a need to offer structured information and a general product overview as well.

So while you can’t constantly rewrite your website, we did manage to re-work http://www.pathomation.com this month and we’re pretty proud of the result. If you haven’t checked it out yet, go ahead and do so. It’s a lot more comprehensive than anything we’ve had up before.

And if you’ve read this blog, of course you’ve heard about PMA.start before, our free whole slide image / digital pathology viewer software that can be used by anybody for anything to manage their local slide content. Our http://free.pathomation.com website is the third axis of our online web-presence strategy.

PMA.start comes with no limitations, except the one that is built-in: you can only use it on local content. If you want to share data with colleagues via a network, you need to upgrade to our professional PMA.core product. If you’re not quite sure what that’s all about, you can still sign up for our beta program until the end of this month (just a few days left, so be quick).

Check out our beta landing page at https://www.pathomation.com/beta

And there you have it; our three pronged strategy to provide you, our valued customer and end-user, with background information about the Pathomation universe, and our great products.

A look at PMA.control

Educational needs and virtual microscopy

Pathomation’s software platform includes software for a number of digital pathology application.

At the most basic end of the spectrum we offer PMA.start for local viewing and research purposes. Even with PMA.start, you get full access to our front-end Javascript-based visualization framework, and back-end automation API (more on those in a separate blogpost).

At the opposite end of the spectrum, we offer a sophisticated training software package called PMA.control. Consider this:

  • Current slide-based approaches to microscopy teaching face the logistical challenge of transporting people, slides and training material.
  • The size, location and number of instructional sessions is limited (in time, place, and size)
  • Concurrent training on the same material is not possible. One microscope can offer one unique slide. Musical chair… erh… microscopes, anyone?

We identified a need has evolved to train students and professionals alike to accurately evaluate tissue material with a broad range of (assay-specific) algorithms. Systematically organizing training materials and bringing training participants together in a virtual settings, is what it’s all about.

Projects

You start in PMA.control by setting up a project. A project describes what it is that you want to organize instructional material around. A project can represent a course at a university, or a drug for a pharma company. A project can delineate a geographical territory. It’s totally up to you.

Projects have various properties. Apart from their name, you can identify them with an icon. This is convenient as your list of project becomes larger and more people become involved in your project. Speaking of involved people: you can identify one or more project managers. This is particularly useful for larger organizations, where one person is seldom available all the time, but it’s relatively easy to find a replacement in case of absence.

Training sessions

A project consists of training sessions. Again, what these mean semantically is completely up to you and your imagination:

  1. One client of ours uses PMA.control to organize weekly seminars in various places across the globe. Each seminar/country combination translates into its own training session in PMA.control, with specific start- and end-dates
  2. One medical school uses PMA.control to train residents. A training session can refer to the class coming in on a particular week, but it can also be linked to small research projects that students participate in.
  3. Another client integrates PMA.control into a web-portal, so all training sessions by definitions are open ended. The client has a drug portfolio, so rather than have them be restricted in time, training sessions refer to various indications for different drugs.

Safe to say that training sessions can be exploited for diverse applications.

Case collections

All right, we have projects and training sessions… When do we put digital slide content in them? This is where case collections come in.

The idea is that you organize your training sessions in different parts. During a three-day seminar, you could have one day dedicated to guided lectures (that’s a case collection with its own slides). On day two, you allow people to evaluate themselves through some hands-on exercises (on a second case collection, which holds different slides than the first one). On day three, it’s crunch time, and attendees take an actual test to see how well they absorbed the material (on yet another third case collection with once again unique slides).

A case collection is coupled to a project, but is independent from any training sessions, so you can re-use them throughout the curriculum that you’re building. Think about it; otherwise if you organized the same training session repeatedly, you would continuously have to re-define the case collections, too!

A case collection consists of cases, which in turn consist of slides. You can choose to construct a case such that it pre-focuses on a particular region of interest (ROI) within a slide. You can also add various meta-data at case-level as well as slide-level. Not unimportantly: you can configure the initial rotation angle for each slide in the case. This is particularly relevant if your case consists of serial sections that may not be all in the exact same orientation.

Interaction modes

Remember the three-day seminar we just mentioned? And you also remember that we called the software “PMA.control”, right?

Ok.

The name PMA.control refers to the fact that the owner of the software is in total control of what participants within a session at any given time.

Consider the following situations during our three-day seminar:

  • On the first day, the instructor wants his pupils to stay nicely in the kiddie pool. They should give their undivided attention only to the material intended for the first day.
  • However, this one person in the afternoon of the first day is taking the seminar for the second time. She asks if she can skip ahead already to the content from day two.
  • On the second day, people are learning and experimenting with a different dataset. Do they understand the material well enough to pass the test on the last day? Clearly the material from day three must not be visible to anybody yet.
  • On day three, it’s crunch time. The actual test material is now released. Depending on the intention of the instructor, earlier discussed material can now even be closed off.

For all these conditions, PMA.control offers interaction modes. An interaction mode controls if and how a case collection presents itself to the end-user.

Training sessions consist of multiple case collections and multiple users participate in a training session. At any given time, the instructor of a session can specify whether a particular case collection within the session is accessible to a specific user and how.

When signing into PMA.control, the instructor sees a grid with users and case collections. This grid can be used to control what user interacts with what case collection.

PMA.control ships with a number of default interaction modes, but these can be customized via a matrix interface where one stipulates what properties are associated with each.

Let’s see how interaction modes come into play during our three-day seminar:

  • Before leaving for the seminar, the instructor applies the interaction mode “locked” to all case collections for all participants.
  • On the morning of the first day, the instructor walks in the seminar room an hour early an sets the interaction mode of the first case collection to “browse” for everybody to see. That was easy! He goes to the hotel bar to grab a nice cup of coffee.
  • In the afternoon of the first day, an attendee asks about being allowed to skip ahead with the material a bit. The instructor asks if any other people are in the same situation. For those, he sets the interaction mode for the second case collection to “self-test”. Users can interactively fill out a pre-determined scoring form that goes with each case, and they can see each other’s results to discuss their findings amongst themselves.
  • On the second day, the second case collection is unlocked for everybody. Everybody now sees the second case collection in self-test mode. The first case collection remains in “browse” mode, so participants can use these as reference material. The third case collection remains off limits today.
  • On the morning of the third day, the first thing the instructor does is reset the first and second case collection in the training session to “locked”. Rien ne va plus. The third and last case collection is switched to “test”, and students can take their final assessment. Users can interactively fill out a pre-determined scoring form that goes with each case, but they can’t see each other’s data anymore.

It’s possible for an instructor to be an instructor for one session, but only a “regular” participant in another. This is especially useful in medical schools where different specialists consult with and train each other on various subject matters on a continuous basis.

Conclusion

PMA.control allows you to assemble whole-slide images, scoring forms, consensus scores and scoring manuals into digital training modules. Full service, no-hassle, management of the software is offered through the PathoTrainer service, which is organized through Pathomation’s parent company HistoGeneX.

If you’re interested in higher education teaching of histology or pathology, make sure to also have a look at Pathomation’s education landing page. For pharma and CRO application, we have a separate landing page.

Look sharp

Blurred or focused?

In an earlier post, we offered some ideas on how to detect tissue in a scanned slide.

The next step people often want to take is to examine how the sharpness of the tissue is distributed throughout the slide. No scanner catches all, and you will see blurry areas in pretty much all your scans.

It then helps to be able to differentiate the tiles that have poor focus from the tiles that are sharp. As it turns out, we already approached a likewise problem when we were determining the sharpest tiles within z-stacked slides.

For this particular exercise (focus variation within a single plane) Sied Kebir in Germany was kind enough to provide us with relevant sample data for this one.

And here are two relevant extracted tiles to illustrate the problem.

Sied is looking for a method to systematically map the blurry tiles vs the crisp ones.

Blur detection with OpenCV

A good introduction on blur detection with the OpenCV library is offered by pysource in the following video tutorial:

Let’s see what that gives when we apply it to Sied’s sample images:

Great! The numbers are not as far apart as in pysource’s video, but that makes sense: even in focused tissue we’ll find many more gradients and sloping color ranges than in the average picture of person sitting a room, which contains distinctive features like outlined walls and facial contours.

Pysource suggests converting your original images to grayscale. Does it make a difference? In our experiments we find different values (of course), but the trend is the same. Since the retention color of leads to slightly bigger differences, we’re inclined to sticking with the original color images.

If you do want to convert your color tiles to grayscale, here’s a great StackOverflow article about how this works.

Distribution and Exploratory Data Analysis (EDA)

Our next step is to put it all in a loop and systematically examine how sharp or blurry each individual tile actually is. For semantic ease, we create a get_blurriness function:


from pma_python import core
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
import sys

def is_tissue(tile):
    pixels = np.array(tile).flatten()
    mean_threahold = np.mean(pixels) < 192   # 75th percentile
    std_threshold = np.std(pixels) < 75
    return mean_threahold == True and std_threshold == True

def is_whitespace(tile):
    return not is_tissue(tile)

def get_sharpness(img):
    pixels = np.array(img).flatten()
    return cv2.Laplacian(pixels, cv2.CV_64F).var()

slide = "C:/wsi/sied/test.svs"
max_zl = 5 # or set to core.get_max_zoomlevel(slide)
dims = core.get_zoomlevels_dict(slide)[max_zl]

means = []
stds = []
tissue_map = []
sharp_map = []

for x in range(0, dims[0]):
    for y in range(0, dims[1]):
        tile = core.get_tile(slide, x=x, y=y, zoomlevel=max_zl)
        tiss = is_tissue(tile)
        tissue_map.append(tiss)
        if (tiss):
            sharp_map.append(get_sharpness(tile))
        else:
            sharp_map.append(0)
    print(".", end="")
    sys.stdout.flush()
print()

After getting all the result, it is worth examining the histogram of this data.

Ideally, we would like to see a bimodal distribution (sharp vs blurred), but that’s not what we see here. The reason is that unevenness in tissue is actually not distributed unevenly.

Putting it all together

Now that we know what we can expect, it’s just a matter of putting it all together. and use it to construct an image map, in similar fashion as we did for our original tissue detection.

The final result looks like this:

In closing

Sectioning a slide is a continuous operation, and except for folding artifacts, you shouldn’t expect any abrupt changes. Tissue can be expected to gradually fade in and out of focus. And while scanners have gotten better at compensating for uneven tissue thickness, we’re not quite there yet, and automated analysis based on a technique like we’re here proposing can help.

Last but now least, we decided on a new way to organize our sample code. At http://host.pathomation.com/realdata/ you can from now on see all sample data in a single location. For example, the Jupyter notebook that belongs with this entry, is available at http://host.pathomation.com/realdata/jupyter/realdata%20033%20-%20look%20sharp.ipynb

Exploiting the Pathomation software stack for business intelligence (BI)

A customer query

As part of our commercial offering, Pathomation offers hosting services for those organization that don’t want to make a long term investment in on-premise server hardware, or whose server farm is incompatible with our system requirements (PMA.core requires a Microsoft Windows stack).

One such customer came to us recently and asked how much space they were consuming on the virtual machine they rented from us.

This was our answer:

And this is how we obtained the answer. We wrote the following script:

In Python:


from pma_python import core
def get_slide_size(session, slideRef):
    info = core.get_slide_info(slideRef, sessionID = session)
    return info["PhysicalSize"]
def get_total_storage(session, path):
    total = 0
    for slide in core.get_slides(path, session, True):
        total = total + get_slide_size(session, slide)
    return total
def map_storage_usage(srv, usr, pwd):
    sess = core.connect(srv, usr, pwd)
    map = {}
    for rd in core.get_root_directories(sess):
        map[rd] = get_total_storage(sess, rd)
    return map

You can also do this in PHP, of course:


<?php
require "lib_pathomation.php"; 	// PMA.php library

use Pathomation\PmaPhp\Core;

function getTotalStorage($sessionID, $dir) {
        $slides = Core::getSlides($dir, $sessionID, $recursive = FALSE);
        $infos = Core::GetSlidesInfo($slides, $sessionID);
        
        echo "Got slides for ".$dir.PHP_EOL;

        $func = function($value) {
            return $value["PhysicalSize"];
        };

        $s = array_sum(array_map($func, $infos));
        return $s;
}

function mapStorageUsage($serverUrl, $username, $password) {
    $sessionID = Core::Connect($serverUrl, $username, $password);
    
    $map = array();
    $rootdirs = Core::getRootDirectories($sessionID);
    foreach ($rootdirs as $rd) {
        $map[$rd] = getTotalStorage($sessionID, $rd);
        
    }
    return $map;
}
?>

Creating a dictionary that contains the number of consumed bytes per root-directory now comes down to using just a single line of code:

In Python:

print(map_storage_usage("http://server/pma.core", "user", "secret_password"))

And in PHP:


print_r( mapStorageUsage ("http://server/pma.core", "user", "secret_password"));

Making our code mode robust

Unfortunately, chances are that if you’ve been running your PMA.core tile server for a while, the script comes down crashing miserably. It the “should have could have would have” syndrome of programming. There’s probably a meme for this out there somewhere (in the interest of productivity, we won’t search for it ourselves but let you look for that one). What it comes down to is: mounting points and access permissions chance, and at some point in time in any large enough data repository some files are going to end up corrupt, meaning either the core.get_slides() call is going to go wrong, or the core.get_slide_info() call.

So to make our script a bit more robust, we can add try… catch… exception handling in Python:


def get_slide_size(session, slideRef):
    try:    
        info = core.get_slide_info(slideRef, sessionID = session)
        return info["PhysicalSize"]
    except:
        print("Unable to get slide information from", slideRef)
        return 0

def get_total_storage(session, path):
    total = 0
    try:
        for slide in core.get_slides(path, session, True):
            total = total + get_slide_size(session, slide)
    except:
        print("unable to get data from", path)
    return total

def map_storage_usage(srv, usr, pwd):
    sess = core.connect(srv, usr, pwd)
    map = {}
    for rd in core.get_root_directories(sess):
        map[rd] = get_total_storage(sess, rd)
    return map

As well as in PHP:


<?php
require_once "lib_pathomation.php";

use Pathomation\PmaPhp\Core;

function getTotalStorage($sessionID, $dir) {
    try {
        $slides = Core::getSlides($dir, $sessionID, $recursive = FALSE);
        $infos = Core::GetSlidesInfo($slides, $sessionID);
        
        $func = function($value) {
            return $value["PhysicalSize"];
        };

        $s = array_sum(array_map($func, $infos));
        return $s;
    }
    catch(Exception $e) {
        // echo "unable to get data from ".$dir.PHP_EOL;
    }
}

function mapStorageUsage($serverUrl, $username, $password) {
    $sessionID = Core::Connect($serverUrl, $username, $password);    
    $map = array();
    $rootdirs = Core::getRootDirectories($sessionID);
    foreach ($rootdirs as $rd) {
        $map[$rd] = getTotalStorage($sessionID, $rd);   
    }
    return $map;
}

print_r(mapStorageUsage("http:/server/core/", "user", "secret"));
?>

In Python, you can also add the prettyprint library to clean up the output a bit:

And in PHP, we add a convenient method to make the numbers a bit easier to read:


function human_filesize($bytes, $decimals = 2) {
$size = array('B','kB','MB','GB','TB','PB','EB','ZB','YB');
$factor = floor((strlen($bytes) - 1) / 3);
return sprintf("%.{$decimals}f", $bytes / pow(1024, $factor)) . @$size[$factor];
}

Input for business intelligence (BI)

In this article we showed how you can use automation to let Pathomation’s PMA.core tile server generate an overview report of how much space your slides consume. Depending on your specific folder structure, you can further customize this for breakdowns into sizable data-morsels that fit your particular appetite.

Given the size of the average size of a slide, this kind of information can be vital to your organization: Slides can accumulate fast, and it is important to keep a handle on their growth and space occupation.

Pathomation’s software platform is more than just a slide viewing solution: it can help to generate insights in your storage resource consumption and be used as a veritable planning and evaluation tool before actual investments take place.

A look at PMA.core

The core

PMA.core is the centerpiece of the Pathomation software platform for digital microscopy. PMA.core is essentially a tile server. It does all the magic described in our article on whole slide images, and is optimized to serve you the correct field of view, any time, any place, on any device. PMA.core enables digital microscopy content when and where you want it.

Our free viewer, PMA.start, is built on top of PMA.core technology as well. PMA.start contains a stripped version of PMA.core, lovingly referred to as PMA.core.lite 🙂

PMA.core supports the same file formats as PMA.start does, and then some.

It acts as an “honest broker”, by offering pixels from as many vendor formats as possible.

Storage

With PMA.start, you’re limited to accessing slide content that’s stored on your local hard disk. External hard disks are supported as well, but at one point you end up with multiple people in your organization that need to access the same slide content. At that point, PMA.core’s central storage capabilities come into play.

PMA.core is typically installed on a (Windows) server machine and can access a wider variety of storage media than PMA.start can. You can store your virtual slides on the server’s local hard disk, but as your data grows, this is probably not the place you want to keep them. So you can offload your slides to networked storage, or even S3 bucket repositories (object storage).

Pathomation does not, in contrast with other vendors, require a formal slide registration or import process to take place. Of course our software does need to know where the slides are. This is done by defining a “root-directory”, which is in its most generic terminology “a place where your slides are stored”.

A root-directory can be location on the server’s hard disk, like c:\wsi. You can instruct your slide scanner to drop off new virtual slides on a network share, and likewise point PMA.core to \\central_server\incoming_slides\. Finally, you can store long-term reference material in an AWS bucket and define a root-directory that points to the bucket. The below screenshot shows a mixture of S3- and HDD-derived rootdirectories in one of our installations:

After defining your root-directory, the slides are there, or they are not, and representation of them is instant. An implication of this that you can manage your slides with the tools that you prefer; any way you want to. You can use the Windows explorer, or even using the command-line, should that end up being more convenient for you. Your S3 data can be managed through the AWS console, CloudBerry tools, or S3 explorer.

Security – Authentication

Another important aspect of PMA.core is access control. PMA.start is “always on”; no security credentials are checked when connecting to it. PMA.core in contrast requires authentication, either interactively through a login dialog, or automatically through the back-end API. In either case, upon success, a SessionID is generated that is used to track a user’s activity from thereon.

User accounts can be created interactively through the PMA.core user interface, or controlled through use of the API. Depending on your environment, a number of password restrictions can be applied. Integration with LDAP providers is also possible.

User accounts can be re-used simultaneously in multiple applications. You can be logged in through the PMA.core user interface, and at the same time use the same credentials to run an interactive script in Jupyter (using the PMA.core interface to monitor progress).

The interface in PMA.core itself at all times gives an overview what users are connected through what applications, and even allows an administrator to terminate specific sessions.

Security – Authorization

Our software supports authorization on top of authentication.

User permissions in PMA.core are kept simple and straightforward: a user account can have the Administrative flag checked or not, meaning that they can get access to PMA.core directly, or only indirectly through other downstream client application like PMA.view, PMA.control or the API. Another useful attribute to be aware of is CanAnnotate, which is used to control whether somebody can make annotations on top of a slide or not. Finally, an account can be suspended. This can be temporary, or can be mandated from a regulatory point of view as an alternative for deletion.

A root-directory can be tagged either as “public” or “private”. A public root-directory is a root-directory that is available to all authenticated users. In contrast, when tagged as “private”, the root-directory has an accompanying Access Control List (ACL) that determines who can access content in the root-directory.

The screenshot below shows the Administrative and Suspended flags for my individual user account, as well as what public and private root-directories I do or do not have access to:

Future versions of PMA.core can be expected to offer CRUD granularity.

A powerful forms engine

Form data exists everywhere. Information can be captured informally, like the stain used, or as detailed as an Electronic Lab Request Form (ELRF). This is why Pathomation offers the possibility to define forms as structured and controllable data entities. A form can consist of a couple of simple text-fields, or be linked to pre-defined (ontology-derived) dictionaries. Various other Pathomation software platform components help in populating these forms, including PMA.view.

Forms can be accompanied by ACLs. In order to avoid redundancy, a form ACL consists of a list of root-directories rather then user accounts. In a project-oriented environment, it makes more sense that certain forms apply to certain root-directories which represent types of slides. Similarly, in a clinical environment, it makes sense to have slides organized in root-directories per application-type or by processing-stage. Freshly scanned slides that haven’t undergone a QA-check yet can be expected to have different form-data associated with them than FISH-slides.

On-slide annotations

PMA.core support graphical on-side annotations. We support three types:

  • Native annotations embedded within a vendor’s file format
  • Third-party annotations coming from non-specific (image analysis) software
  • Pathomation annotations

Pathomation-created annotations are the easiest to understand. You have a slide, and you want to indicate a region of interest on it. This region of interest can be necrotic tissue, or proliferated tumor cells. For teaching purposes, you could have a blood smear and highlight the different immune-celltypes.

Pathomation annotations are stored as WKT and can be anything that can be encoded in WKT (which is a lot). You need a downstream client to create them, but the basic viewer included in PMA.core can be used to visualize them, and our PMA.UI JavaScript framework can be used to create your own annotation workflows.

You could run an algorithm that does tissue detection and pre-annotates these regions for you.

In addition to making your own annotations, Pathomation can be used to integrate annotations from other sources. Certain file formats like 3DHistech’s MRXS file format or Aperio’s SVS file format have the ability to incorporate annotations. If you have such slides, the embedded annotations should automatically show when viewing the slide using any Pathomation slide rendering engine.

Last but not least, we can integrate third-part annotations. Currently, we support three formats:

Third-party as well as native annotations are read-only; you cannot modify them using Pathomation software.

Even more slide metadata

What about other structured data?

We think our forms engine is pretty nifty, but we’re not as arrogant (or clueless) to pretend that we foresee everything you ever want to capture in any form, shape, or size. It is also quite possible that a slide meta-database already exists in your organization.

For those instances where existing data stores are available, we offer the possibility to link external content. Rather than importing data into PMA.core (also a possibility actually), we allow you to specify an arbitrary connection string that points to an external resource that may represent an Oracle database. Your next step is to define the query to run against this resource, along with a field identifier (which can be a regular expression) that is capable to match specific records with individual slides.

Examples of external data sources can be:

  • Legacy IMS data repositories that are too cumbersome to migrate
  • Proprietary database systems developed as complement to lab experiments
  • Back-end LIMS/VNA/PACS databases that support other workflows in your organization

Do try this at home

In this post, we’ve highlighted the main features of our PMA.core “honest broker” WSI engine aka tile server aka pixel extractor aka Image Management Server (IMS).

Warning: sales pitch talk following below…

If you’ve liked interaction with PMA.start and work in an organization where slides are shared with various stakeholders, you should consider getting a central PMA.core server as well. PMA.core is the center-piece of the Pathomation software platform for digital microscopy, and whether you prefer all-inclusive out-of-the-box viewing software, or are developing your own integrated processing pipelines, PMA.core can be the ideal middleware that you’ve been looking for. Contact us today for a demo or sandboxed environment where you can try out our components for yourself.

Ok, we’re done. Seriously, PMA.core is cool. Let us help you in your quest for vendor-agnostic digital pathology solutions, and (amongst others) never worry about proprietary file formats again.

 

To index or not to index?

A question we get frequently from potential customers is “how do we import our slides into your system (PMA.core)?”. The answer is: we don’t. In contrast with other Image Management Systems (IMS), we opted to not go for a central database. In our opinion, databases only seem like a good idea at first, but inevitable always cause problems down the road.

People also ask us “how easy is it to import our slides?”. The latter phrasing is probably more telltale than the first, as it assumes that is not the case apparently with other systems, i.e., other systems often put you in a situation where it is not easy to register slides. It still puts us in an awkward position, as we then actually have to explain that there is no import process as such. Put the slides where you want them, and that’s it. You’re done. Finito.

Here are some of the reasons why you would want a database overlaying your slides:

  • Ease of data association. Form data and overlaying graphical annotation objects can be stored with the slide’s full path reference as a foreign key.
  • Ease of search for a specific slide. Running a search query on a table is decidedly faster than parsing a potentially highly hierarchical directory tree structure
  • Rapid access to slide metadata. Which is not the same as our first point: data association. Slide metadata is information that is already incorporated into the native file format itself. A database can opt to extract such information periodically and store it internally in a centralized table structure, so that it is more easily extracted in real time when needed.

When taken together, the conclusion is that such databases are nothing more but glorified indexing systems. Such an indexing system invariable turns into a gorilla… An 800 lbs gorilla for that matter… Let’s talk about it:

  • An index takes time to build
  • An index consumes resources, both during and after construction.
  • With a rapidly evolving underlying data structure, the index is at risk of being behind the curve and not reflecting the actual data
  • In order to control the index and not constantly having to rebuild it, a guided (underlying) data management approach may be needed
  • At some point, in between index builds, and outside the controlled data entry flow, someone is going to do something different to your data
  • Incremental index builds to bypass performance bottlenecks are problematic when data is updated

Now there are scenarios where all of the above doesn’t matter, or at least doesn’t matter all that much. Think of a conventional library catalog; does it really matter if your readers can only find out about the newest Dean Koontz book that was purchased a day after it was actually registered in the system? Even with rapidly moving inventory systems: when somebody orders an item that is erroneously no longer available from your webstore… Big whoop. The item is placed on back-order, or the end-user simply cancels the order. It you end up making the majority of your customers mad this way, then the problem is not in your indexing system, but in your supply chain itself. There’s no doubt that for webshops and library catalogs, indexes speed up search, and the pros on average outweigh the cons.

But digital pathology is different. Let’s look at each of the arguments against indexing and see how much weight they carry in a WSI environment:

  • An index takes time to build. When are you going to run it? Digital pathology was created so you can have round the clock availability of your WSI data. Around the clock. Across time-zones. Anything that takes time, also takes resources. CPU cycles, memory. So expect performance of your overall system to go down while this happens.
  • Resource (storage) consumption during and after construction. So be careful about what you are going to index in terms of storage. Are you going to index slide metadata? Thumbnails? How much data are your practically talking about? How much data are you going to index to begin with? And how much of your indexed data will realistically ever be accessed (more on that subject in a separate post)?
  • Rapidly evolving underlying data structure. Assume a new slide is generated once every two minutes, and a quantification algorithm (like HistoQC) takes about a minute to complete per slide. This means you have a new datapoint every minute. And guess which datapoint the physician wants to see now now NOW…
  • Guided data management approach. One of the great uses of digital pathology is the sharing of data. You can share your data, but other can also share it with you. So apart from your in-house scanner pipeline; what do you going to do with the external hard disk someone just sent you? Data hierarchies come in all shapes and sizes. Sometimes it’s a patient case; sometimes it’s toxicological before/after results; sometimes it’s a cohort from a drug study. Are you going to setup data import pipelines for all these separate scenarios? Who’s going to manage those?
  • Sometimes, somewhere, someone is going to do something different to your data. Because the above pipelines won’t work. No matter how carefully you design them. Sometimes something different is needed. You need to act, and there’s no time for re-design first. The slide gets replaced, and now the index is out-of-date. Or the slide is renamed because of earlier human error, and the index can’t find it anymore. And as is often the case: this isn’t about the scenarios that you can think of; but about the scenarios you can’t.

Safe to say that we think an indexing mechanism for digital pathology and whole slide images is not a good idea. Just don’t do it.