Show me the data - Getting quality datasets for your clinical AI models

Anyone starting on clinical research that uses AI or machine learning has faced this challenge. We need good quality labeled data to train our models. Good data trumps complex models almost all the time. But a common challenge with medical or clinical data is the concern around privacy and acceptable use. It is very hard to find complete patient datasets if you are not already working in a research lab.

In this post, I will list where we can get good quality datasets. I will focus on cancer as this is the area that I have some familiarity with through my own independent research. We will also navigate a very popular data platform that I have used to train my own models. Most, if not all data portals listed here are curated, managed and maintained under the auspices of the National Institute of Health’s (NIH) National Cancer Institute. It is amazing the amount of data collected and available for free to further research on a disease that is increasing in its global reach and intensity. What a great use of public funding!!

Major data sources for AI model training

  1. The Cancer Imaging Archive (TCIA) - The mother lode for cancer focused images. Institutions can submit images, which go through a review and approval process by the TCIA Advisory Group before being deidentified and curated into a “collection”, or grouping, based on cancer type (lung, brain etc.), image type (MRI, CT etc.), and research focus. In addition to the images, many collections have the corresponding patient clinical data, which allows for AI use cases such as survival analysis in addition to tumor detection and segmentation. However, I have observed gaps in the clinical data that will need to be accounted for in any model training and results analysis. Certain collections are limited in access and you will need to fill out a restricted license request form to gain access to the datasets. Read more about it here.

  1. The Cancer Genomic Atlas (TCGA) / Genomic Data Commons (GDC) - The one stop trusted data platform for cancer related genomic data collected from over 11,000 patients over a period of 12 years that covers 33 cancer types. The datasets span clinical, biospecimen and molecular characterization. The data can be accessed through the GDC Portal website or through APIs. The website also provides visualization and basic analysis functionality.

  1. cBioPortal - A curated collection of cancer related datasets from the TCGA and Memorial Sloan Kettering (MSK) Cancer Center in an easy to download format. The data is categorized by cancer studies and cancer types.

  1. The Cancer Proteome Atlas Portal - A comprehensive data portal focused on functional and cancer perturbed proteomics, grouped by patient cohorts and cancer cell lines.

The Cancer Imaging Archive (TCIA) Primer

When I first began looking for cancer related datasets, I did not know anything about what data was available, its quality in terms of completeness and usability, and what I could do with this data. If you are in the same place, the TCIA is a great resource to get started.

The best place to start is to browse their collections, where you will find a table of available datasets grouped by cancer type, species (most are human), number of subjects (patients) in the study, the type of image data (MRI, CT scans), whether there is supporting clinical data, and the access type (public, or limited access). Most of the MRI data is in the Digital Imaging and Communications in Medicine (DICOM) format, but there are other popular formats such as NIfTI available for certain collections.

Starting our AI journey

Glioblastoma is a malignant brain tumor with a 5-year survival rate of 6.8%, and an average rate of survival of only 8 months. This is a horrendous tumor that has been studied for many decades, but with minimal advances in terms of successful treatment, and a survival rate that is unchanged. Can AI help here?

One of the more popular collections on TCIA based on the references to it in research papers is the UPenn GBM dataset. This dataset contains multiparametric MRI scans of Glioblastoma Multiforme (GBM) patients, automated segmentation labels that were then reviewed and corrected by expert radiologists, and associated patient clinical data. This is a great dataset to begin our AI journey.

In my next post, I will write about how to get the Upenn GBM data and prepare it for analysis and modeling. It took me a while to understand the different types of MRI scans and what they meant. Here’s a great writeup on all things MRI, which is sufficient for us to get started on our AI models.