To index or not to index? – Real Data Digital Pathology blog

A question we get frequently from potential customers is “how do we import our slides into your system (PMA.core)?”. The answer is: we don’t. In contrast with other Image Management Systems (IMS), we opted to not go for a central database. In our opinion, databases only seem like a good idea at first, but inevitable always cause problems down the road.

People also ask us “how easy is it to import our slides?”. The latter phrasing is probably more telltale than the first, as it assumes that is not the case apparently with other systems, i.e., other systems often put you in a situation where it is not easy to register slides. It still puts us in an awkward position, as we then actually have to explain that there is no import process as such. Put the slides where you want them, and that’s it. You’re done. Finito.

Here are some of the reasons why you would want a database overlaying your slides:

Ease of data association. Form data and overlaying graphical annotation objects can be stored with the slide’s full path reference as a foreign key.
Ease of search for a specific slide. Running a search query on a table is decidedly faster than parsing a potentially highly hierarchical directory tree structure
Rapid access to slide metadata. Which is not the same as our first point: data association. Slide metadata is information that is already incorporated into the native file format itself. A database can opt to extract such information periodically and store it internally in a centralized table structure, so that it is more easily extracted in real time when needed.

When taken together, the conclusion is that such databases are nothing more but glorified indexing systems. Such an indexing system invariable turns into a gorilla… An 800 lbs gorilla for that matter… Let’s talk about it:

An index takes time to build
An index consumes resources, both during and after construction.
With a rapidly evolving underlying data structure, the index is at risk of being behind the curve and not reflecting the actual data
In order to control the index and not constantly having to rebuild it, a guided (underlying) data management approach may be needed
At some point, in between index builds, and outside the controlled data entry flow, someone is going to do something different to your data
Incremental index builds to bypass performance bottlenecks are problematic when data is updated

Now there are scenarios where all of the above doesn’t matter, or at least doesn’t matter all that much. Think of a conventional library catalog; does it really matter if your readers can only find out about the newest Dean Koontz book that was purchased a day after it was actually registered in the system? Even with rapidly moving inventory systems: when somebody orders an item that is erroneously no longer available from your webstore… Big whoop. The item is placed on back-order, or the end-user simply cancels the order. It you end up making the majority of your customers mad this way, then the problem is not in your indexing system, but in your supply chain itself. There’s no doubt that for webshops and library catalogs, indexes speed up search, and the pros on average outweigh the cons.

But digital pathology is different. Let’s look at each of the arguments against indexing and see how much weight they carry in a WSI environment:

An index takes time to build. When are you going to run it? Digital pathology was created so you can have round the clock availability of your WSI data. Around the clock. Across time-zones. Anything that takes time, also takes resources. CPU cycles, memory. So expect performance of your overall system to go down while this happens.
Resource (storage) consumption during and after construction. So be careful about what you are going to index in terms of storage. Are you going to index slide metadata? Thumbnails? How much data are your practically talking about? How much data are you going to index to begin with? And how much of your indexed data will realistically ever be accessed (more on that subject in a separate post)?
Rapidly evolving underlying data structure. Assume a new slide is generated once every two minutes, and a quantification algorithm (like HistoQC) takes about a minute to complete per slide. This means you have a new datapoint every minute. And guess which datapoint the physician wants to see now now NOW…
Guided data management approach. One of the great uses of digital pathology is the sharing of data. You can share your data, but other can also share it with you. So apart from your in-house scanner pipeline; what do you going to do with the external hard disk someone just sent you? Data hierarchies come in all shapes and sizes. Sometimes it’s a patient case; sometimes it’s toxicological before/after results; sometimes it’s a cohort from a drug study. Are you going to setup data import pipelines for all these separate scenarios? Who’s going to manage those?
Sometimes, somewhere, someone is going to do something different to your data. Because the above pipelines won’t work. No matter how carefully you design them. Sometimes something different is needed. You need to act, and there’s no time for re-design first. The slide gets replaced, and now the index is out-of-date. Or the slide is renamed because of earlier human error, and the index can’t find it anymore. And as is often the case: this isn’t about the scenarios that you can think of; but about the scenarios you can’t.

Safe to say that we think an indexing mechanism for digital pathology and whole slide images is not a good idea. Just don’t do it.