Columns Conversations

A computational tool that can rapidly identify and analyse coronavirus mutations

Susheela Srinivas

With several new COVID variants emerging in recent months, the importance of analysing and drawing insights from the vast amounts of viral genome sequencing data we currently have available cannot be understated. A team of researchers at the Advanced Centre for Treatment Research & Education in Cancer (ACTREC), Navi Mumbai, have come up with a computational tool called the Infectious Pathogen Detector (IPD), which can quickly process large amounts of DNA sequencing data to spot pathogens and mutations in their genomes.

Featured Sequencing COVID3

Gene sequencing has expanded our understanding of genetic codes and provided valuable insights into genomic mutations. A recent example is the analysis of the emerging dominant strains of the SARS-CoV‑2. However, advanced sequencing methods — collectively called the next-generation sequencing (NGS) platforms — generate diverse sequencing datasets. They pose challenges in scrutinising and comparing a new sample’s genetic data with the existing global data. A tool that can integrate such varied data and quickly identify locally circulating dominant mutants can prove indispensable at this juncture. 

A team of researchers led by Amit Dutt at the Advanced Centre for Treatment Research & Education in Cancer (ACTREC), Navi Mumbai, have developed a computational tool called the Infectious Pathogen Detector (IPD). The tool, enabled with a graphic user interface, can identify 1060 pathogens, including the SARS-CoV‑2 virus, from a biological sample’s genetic data.

We catch up with Amit Dutt to gain insights into the IPD and its SARS-CoV‑2 module.

What is the Infectious Pathogen Detector?

The Infectious Pathogen Detector (IPD) is an open-source computational tool useful for identifying pathogen strains in a given biological sample based on DNA sequencing data. To use the IPD, the user has to upload the genomic sequence (genetic code) of a biological sample. The IPD runs a rapid analysis on the sample from which it can identify the presence of 182 viral strains (including SARS-CoV‑2) and 868 bacterial strains as well as their mutations.

The IPD automatically generates molecular reports such as the abundance of a pathogen or the number of mutations (mutation-rate) present in the sample. Also, the IPD identifies if the sample contains any new mutations. Notably, the tool can recognise mutation hotspots’ — the regions in the pathogen’s genome where mutations are predominantly seen — which is crucial to our understanding of how the pathogen is evolving. After identifying these parameters, the IPD then rapidly compares the observations with a global genomic database and classifies the pathogen as a novel variant or one of the commonly occurring ones. 

We initially designed the IPD to detect cancer-causing pathogens. However, when the pandemic broke out, we included a SARS-CoV‑2 module for rapid analysis of the COVID-19 viral genomic database (GISAID).

To use the IPD’s SARS-CoV‑2 module, a researcher can upload the molecular sequence of a COVID-19 sample to the IPD server hosted at ACTREC. The IPD detects the viral abundance in the sample, then analyses the mutations and classifies them with reference to the original Wuhan viral strain. The classification, called the phylogenetic clade, shows if the sample is the common prevailing local variant or if a newer one has emerged. The user can then opt to generate detailed reports of the analysis.

How does the IPD work and what are its features?

The IPD comprises data processing elements or computerised algorithms that are linked serially to form a data pipeline. The processing elements work sequentially — they are so connected that one stage’s data output becomes the next stage’s input data.

The most important feature of the IPD is its versatility. There are currently several genome sequencing methods that generate heterogeneous genomic data, for example, whole exome, whole transcriptome or whole genome. Some of the sequences generated by gene sequencing methods are long-read data that comprise the entire DNA sequence, running into thousands of base pairs. In contrast, others are short-read — of a few hundred base pairs — obtained from a DNA fragment. Analysing such heterogeneous data can be a challenge.

However, The IPD normalises data from any of these inputs. It employs computational subtraction methodology and statistical analysis to eliminate the sample sequence’s non-relevant parts and zeroes in on the pathogen’s genetic codes. The sample is then rapidly compared for mutations, running through a comprehensive reference library of genomic databases corresponding to 1060 different pathogens.

What are some of your findings so far using IPD’s SARS-CoV‑2 module?

Our study(preprint submitted to BioRxiv) describes a comprehensive dataset of 200865 samples collected from COVID-19 patients across 155 countries from the GISAID database. In all, the IPD detected 2.58 million mutations from these samples.

Team of researchers working on this project at Dutt Lab
The team of researchers working on this project at Dutt Lab. Standing (from left to right): Bhasker Dharavath, Asim Joshi and Amit Dutt; Sitting: Aishwarya Rane and Sanket Desai

Our analysis revealed 13 mutation hotspots in the SARS-CoV‑2 genome. These occurred at least in one-fifth of the samples. In 40,000 or more samples, we found that of the 27 proteins encoded by the SARS-CoV‑2 genome, more than half of all the nonsilent mutations were found in 5 genes — S, N, M, ORF7a and ORF10. Nonsilent mutations lead to protein changes that are more likely to alter protein function and result in a fitter’ virus. In that case, those changes may be favoured and naturally chosen for replication in the virus’ evolutionary cycle leading to a dominant variant. Also, the ORF-proteins are presently understood to have no known function and are considered as non-essential. However, we found that they exhibited an equal natural selection bias by the virus.

When we ran an analysis for the recently reported mutants — B1.1.7, B.1.351, and P.1 (commonly known as the UK, South Africa, and Brazil strains, respectively) — none of them were significantly abundant in the samples accessed until 28 December 2020. This could be due to their inadequate representation in the database as they were emerging strains at that time. 

Can you elaborate on what the above data may imply?

The SARS-CoV‑2 virus has around 29,000 bases in its genome, and each of the bases can mutate, leading to point ‑mutations’. However, point-mutations are finite in number as they are a function of the number of bases. Our study found more than 2 million point-mutations occurring across 21,016 unique bases. In my opinion, a catalogue of all possible such mutations would soon reach its limit. 

However, with continued human-human transmission during the pandemic, combination mutations with higher transmission rates or with the ability to hoodwink the immune system will get selected and enriched in a population. These can emerge as new viral variants which can severely impact the disease outcome. This could also be the reason for the varying fatality or transmission rates observed in different countries. 

Did you also perform this analysis with Indian samples? What were your observations?

We analysed 3,361 full-length viral genomes derived from about 6,000 Indian COVID-19 patient samples present in the GISAID database (as available on 28 December 2020). In all, we found that the mutation rate was comparable with the global rate. We noticed 5.17 nonsilent and 4.39 silent mutations per sample, along with 4,422 unique mutations not reported elsewhere. The Indian samples also shared the same hotspots as the global samples. There were no significant occurrences of the UK, Brazil and South Africa variants. However, as the Indian database is very small compared to the global database, we cannot rule out any emerging occurrence of these variants in our population. 

Why is the Indian dataset small compared to the global dataset?

Presently in India, the emphasis is more on detecting COVID-19 cases than on sequencing. Most of the public sequencing data we used comprised submissions made by researchers from institutes like the Centre for Cellular & Molecular Biology (CCMB), Hyderabad, CSIR-Institute of Genomics & Integrative Biology (IGIB), New Delhi, ACTREC, National Centre for Biological Sciences (NCBS), Bengaluru, to name a few. Subsequently, the authorities have undertaken an active and concrete step by constituting the Indian SARS-CoV‑2 genome consortium (INSACOG), similar to a few prominent consortiums abroad (e.g. COVID-19 genomics UK consortium (COG-UK) and the SARS-CoV‑2 Sequencing for Public Health Emergency Response, Epidemiology, and Surveillance (SPHERES), USA). With a concerted effort under the ambit of INSACOG, we are hopeful that India will soon be a significant contributor to the GISAID database.

How can the IPD analysis reports impact the current situation? 

Different centres employ different sequencing processes or use various reagents; their equipment’s sensitivity may also differ. These variations give different data outcomes for which statistical analysis, downstream processing, and data integration are time-consuming and challenging tasks. The IPD is a handy tool in such a situation, specifically in a consortium-based format. The IPD can be helpful to establish uniformity for standardising the reports and comparing the data across centres. With such diverse SARS-CoV‑2 genomic information in hand, vaccines can be suitably tweaked. They can be updated to match the predominant variant, providing a quick strategy to combat the disease outbreak effectively.