Harmonizing and mining the world’s transcriptomic data for drug discovery and predictive medicine

swissQuant is very pleased to have hosted Dr. Jana Sponarova and Dr. Philip Zimmermann from Nebion AG for a presentation on the emerging field of transcriptomics research, its abundant data, meticulous curation procedures, as well as Nebion’s mission, approach and vision for the future.


Transcriptomics is the study of the complete set of RNA transcripts that are produced by the genome (also known as transcriptome), under specific circumstances or in a specific cell. While most studies focus on messenger RNA (mRNA) molecules, the broader field also includes RNA not translated into proteins.

Technology available for genome analysis has vastly improved during the last decade, causing a meteoric rise in the volume of data produced by research experiments in the field – projections for storage capacity required for genomics-related data in 2025 are as high as 40 exabytes. Moreover, this has cut the cost of genome sequencing by several orders of magnitude. For those reasons, many companies have detected potential in genome analysis. Most businesses focus on b2c services, where customers provide a saliva sample that is analyzed to generate reports on ancestry and particular genetic predispositions. Consequentially, more than 90% of data is utilized for analysis of isolated cases, discovery and profiling, as enabled by next-generation sequencing methodology. Research areas focusing on cross-analysis of a larger number of transcriptomes combined with proteome and metabolome data which can uncover diverse targets and biomarkers remain somewhat under the radar, despite the underlying potential and likely impact.

Founded in 2008 as a spin-off from ETH Zurich, Nebion AG has established itself as a key provider of innovative technologies for large-scale curation, quality control, normalization, integration and global transcriptomics data mining. Their aim is to learn from the transcriptome for improved understanding of biology and more effective identification of targets and biomarkers. Identifying the right transcriptomics dataset(s) for a specific task, in an either academic or commercial setting, is an inherently difficult task – these are largely non-standardized, case-specific and often originate from very disparate public sources and experiments. Besides providing data and discovery services to remedy the aforementioned issues, Nebion also offers software tools that enable analytics and visualization, on top of their deeply curated and structured database of microarrays, bulk tissue and single-cell RNA sequencing data collected from public repositories.

Data curation
The construction of reliable databases and efficient querying engines is preceded by a meticulous process of data curation. After the original data is retrieved from public repositories (e.g. GEO, ArrayExpress), it is run through a set of quality control checks and normalization procedures, before being annotated using controlled vocabularies by a team of highly qualified scientists. This includes, among other steps:

  • checks of differential expression
  • verification of consistency of signal values between experiments
  • duplication detection and handling (typically caused by re-use of control groups or repeated studies in repositories, affects approximately 10% of data)
  • exclusion of data rendered unusable by low technical quality or corruption (5-10% of the data)
  • fixing discrepancies between data retrieved from repositories and published results
  • peer reviews of annotations for accuracy and consistency

As a result of this process, data and meta-data from studies are often enriched and improved in quality, making expression values and annotations comparable across studies.

Machine learning
The current Nebion compendium contains over 4000 manually curated studies and it is constantly growing. Studies related to oncology, neurology, respiratory, immunology and cardiology represent important parts of the compendium, but studies from all major therapeutic areas have been curated. The Genevestigator engine interfacing the database is currently capable of identifying novel targets and biomarkers by means of detailed analysis and visualization, indicator discovery and information retrieval. According to Dr. Zimmermann, a highly promising venue for innovation presents itself in the form of predictive analytics powered by machine learning. A particularly interesting use case is in disease progression monitoring. Disease timelines can be roughly partitioned into four stages:

  • susceptibility stage: pre-exposure period, individuals display vulnerability or heightened risk
  • subclinical stage: no recognizable symptoms
  • clinical stage: recognizable signs and symptoms
  • post-clinical stage: resulting in recovery, disability or death

Typically, diagnosis occurs after first symptoms have appeared (or later), which in case of some more serious ailments leaves very little time for treatment, often at a high cost. Such situations can potentially be prevented with genomic and transcriptomic profiling and preventative treatment. Leveraged by abundant transcriptomic data, ML can be used to discover markers and latent drivers for an even larger set of diseases, allowing doctors to monitor for susceptibility of their patients to acquire a condition and onset of disease at an early stage. Moreover, carrying out regular blood diagnostics and longitudinal studies using RNA profiling would help overcome the issue of genetic variability, enrich the existing transcriptomic data and enable preventative treatment and effective disease monitoring. However, the current regulatory environment, invasiveness of certain sampling procedures and the relatively small collection of available blood sample datasets poses a significant barrier to progress in this area.

Share on: