Comparative Whole-Genome Analysis of Clinical Isolates Reveals Characteristic Architecture of Mycobacterium tuberculosis Pangenome

 

Periwal et al

 

Corresponding Author Email address: vinods@igib.in

Abstract

The tubercle complex consists of closely related mycobacterium species which appears to be variants of a single species. Comparative genome analysis of different strains could provide useful clues and insights into the genetic diversity of the species. We integrated genome assemblies of 96 strains from Mycobacterium tuberculosis complex (MTBC) which included 8 Indian clinical isolates sequenced and assembled in this study, to understand its pangenome architecture. We predicted genes for all the 96 strains and clustered their respective CDSs into homologous gene clusters (HGCs) to reveal a hard-core, soft-core and accessory genome component of MTBC. The hard core (HGCs shared amongst 100% of the strains) was comprised of 2,066 gene clusters whereas the soft core (HGCs shared amongst at-least 95% of the strains) comprised of 3,374 gene clusters. The change in the core and dispensable genome components when observed as a function of their size revealed that MTBC has an open pangenome. We identified 74 HGCs absent from reference strains H37Rv and H37Ra but were present in majority of clinical isolates. We report PCR validation on 10 candidate genes depicting 8 genes completely absent from H37Rv and H37Ra whereas 2 genes shared partial homology with them accounting to probable insertion and deletion events. The pangenome approach is a promising tool for studying strain specific genetic differences occurring within species. We also suggest that since selecting appropriate target genes for typing purposes requires the expected target gene be present in all isolates being typed, therefore, estimating the core-component of the species becomes a subject of prime importance.

 

Links to the Data:

Strain Id/SRA Accession

Contigs

SRA

OSDD487/SRR786669

http://genome.igib.res.in/Mtb_Pangenome/OSDD487.fas

http://www.ncbi.nlm.nih.gov/sra/?term=SRR786669

OSDD472/SRR786667

http://genome.igib.res.in/Mtb_Pangenome/OSDD472.fas

http://www.ncbi.nlm.nih.gov/sra/?term=SRR786667

OSDD071/SRR786373

http://genome.igib.res.in/Mtb_Pangenome/OSDD071.fas

http://www.ncbi.nlm.nih.gov/sra/?term=SRR786373

OSDD326/SRR786668

http://genome.igib.res.in/Mtb_Pangenome/OSDD326.fas

http://www.ncbi.nlm.nih.gov/sra/?term=SRR786668

OSDD518/SRR786670

http://genome.igib.res.in/Mtb_Pangenome/OSDD518.fas

http://www.ncbi.nlm.nih.gov/sra/?term=SRR786670

OSDD630/SRR786188

http://genome.igib.res.in/Mtb_Pangenome/OSDD630.fas

http://www.ncbi.nlm.nih.gov/sra/?term=SRR786188

OSDD386/SRR784917

http://genome.igib.res.in/Mtb_Pangenome/OSDD386.fas

http://www.ncbi.nlm.nih.gov/sra/?term=SRR784917

OSDD504/SRR786397

http://genome.igib.res.in/Mtb_Pangenome/OSDD504.fas

http://www.ncbi.nlm.nih.gov/sra/?term=SRR786397

 

 

Predicted genes for all 96 MTBC complex genomes using Prodigal gene prediction software (files are named with genome accessions): http://genome.igib.res.in/Mtb_Pangenome/Pred_genes/