Pre-requisites

Raw sequencing reads: Mitochondrial DNA sequence data is stored in fastq format which reports nucleotide sequence with their Phred qualities in a single file. Command-line version is compatible with data having 51 bp of read length.

Reference genome: The analysis is based on alignment of sequencing reads with modified version of revised Cambridge Reference Genome Sequence (rCRSm) which can be downloaded from http://genome.igib.res.in/mitomatic/

Alignment tool: MAQ, BWA or BOWTIE can be used for sequence alignment and variant calling. Pileup file summarizes the base calls of aligned reads to a reference sequence and is required to extract variations in the sample reads.

Command-line version: The script (mitomatic.pl) provided in the command-line version can be used either to run the complete computational pipeline or to parse the pileup file created by MAQ, BWA or BOWTIE. The parsed file is can be uploaded to mit-o-matic (http://genome.igib.res.in/mitomatic/) for further analysis.

Best Practice

FastQ quality analysis: Use FastQC to analyse various parameters such as sequence quality, per base sequence content, sequence length distribution and presence of overrepresented sequences.

Adapter trimming: It is preferred to remove overrepresented sequences like adapters, PCR primers and other artifacts. Use Trimmomatic to remove adapters and check the FastQC report again to confirm if all the overrepersented sequences have been removed.

Quality Trimming: Since quality of reads often degrades over the course of a sequnce, the low quality bases should also be removed. Average quality of 30 is usually considered good and can be set while running trimmomatic as well. DynamicTrim script from SolexQA tool can also be used for qulaity trimming.

Length Sort: After quality triming, it is preferred to separate high quality reads from low quality on a user-defined length cutoff. LengthSort script from trimmomatic as well. DynamicTrim script from SolexaQA tool can be used for length sort over the average sequence length identified by FastQC.

Alignment mismatch: While aligning the reads to the reference sequence, alignment mismatch count of 2 is considered to be standard.

Paired End: Using paired-end reads for the analysis improves results over using single-end reads.

Command-line version

Command-line platform of mit-o-matic offers execution of the entire computational pipeline with a single command. The script provided runs alignment of the sample reads on the modified reference genome (rCRSm) using the selected platform (MAQ, BWA or BOWTIE). Pileup file generated is then parsed to extract and filter variations based on heteroplasmic frequencies. Finally the parsed pileup is submitted to http://genome.igib.res.in/mitomatic/ with LWP. This generates a sample report in the form of an html file (mito_out.html) in the output directory.

The user can use the same script to just parse an already created pileup and save it for future use for analysis of mitochondrial variations with mit-o-matic. Different heteroplasmic frequencies can be set depending on the depth of sample reads. Usually 10% heteroplasmic frequency is considered to be acceptable.

Untar the downloaded file: tar -xvzf mitomatic_command-line.tar.gz

Change directory: cd mitomatic_command-line

Complete pipeline

USAGE: perl mitomatic.pl -t [bowtie/bwa/maq] -o [se/pe] -f [Percentage of heteroplasmic frequency] -d [output directory] -i [input fastq files]

Parsing pileup

USAGE: perl mitomatic.pl -p [pileup file] -t [bowtie/bwa/maq] -f [Percentage of heteroplasmic frequency]

-t    Tool to be used for the alignment from BWA, BOWTIE and MAQ
-o   Type of data, whether single end or paired end
-f    Percentage of heteroplasmic frequency to call the variants [Preferred: 10]
-d   Name of the output directory
-I    Input fastq file. In case of paired end data mention both the files one after other
-p   For parsing an already generated pileup
-n    For just parsing the pileup file, mention the output filename after -n. By default the output is saved in parsed_all.pileup.

-h    Help

Examples for command line

1. Set default parameters: perl mitomatic.pl -c -i sample.fastq

Runs complete pipeline with default parameters: alignment using maq, assuming data as single end, filtering variations with 10% hetereoplasmic frequency. Output is stored in folder named mitomatic_out.

2. Change the alignment tool: perl mitomatic.pl -c -i sample.fastq -t bwa

Runs complete pipeline using bwa for alignment, assuming data as single end, filtering variations with 10% hetereoplasmic frequency. Output is stored in folder named mitomatic_out. To align with maq/bowtie, replace bwa with maq/bowtie.

3. Change the alignment tool and set frequency: perl mitomatic.pl -c -i sample.fastq -t bwa -f 5

Runs complete pipeline using bwa for alignment, assuming data as single end, filtering variations with 5% hetereoplasmic frequency. Output is stored in folder named mitomatic_out. Set frequency after option -f.

4. Change name of the output directory: perl mitomatic.pl -c -i sample.fastq -d sample_maq

Runs complete pipeline using maq for alignment, assuming data as single end, filtering variations with 10% hetereoplasmic frequency. Output is stored in folder named sample_maq.

5. For paired end data: perl mitomatic.pl -c -i sample_1.fastq sample_2.fastq -o pe

Runs complete pipeline using maq for alignment considering paired end data, filtering variations with 10% hetereoplasmic frequency. Output is stored in folder named mitomatic_out. In case -o pe option is not given, only the first file is considered for alignment, ignoring the second file.

6. For parsing already prepared pileup file: perl mitomatic.pl -p all.pileup -t bwa

Filters variations with 10% heteroplasmic frequency from the pileup created by bwa and prepares a file parsed_all.pileup compatible with mit-o-matic online version. -t option is mandatory while parsing pileup.

7. For parsing already prepared pileup file with a defined frequency: perl mitomatic.pl -p all.pileup -t maq -f 5

Filters variations with 5% heteroplasmic frequency from the pileup created by maq and prepares a file parsed_all.pileup compatible with mit-o-matic online version.

Pileup Format

It is a text-based format for describing the base-pair information of the reads aligned to a reference sequence. It facilitates the display of SNP/Indel calling and alignment as well. Each line consists of 5-6 tab separated columns that include Sequence identifier, position in sequence (1-based coordinate system), reference base, number of reads covering the site (depth of coverage), read bases and base qualities.

Format Description

Parsed pileup Format for mit-o-matic

The input for mit-o-matic is a tab-delimited file, prepared from the pileup file. Each line consists of a sequence identifier, position, reference base, number of reads covering the site (depth of coverage), count of A, T, G, C nucleotides, number of forward and reverse reads, counts of nucleotides in forward and reverse strands repectively. To prepare the input file for mit-o-matic in this format, a perl script (mitomatic.pl) has been provided in the command-line version available at the home page.
Use command-line version of mit-o-matic to parse the pileup file

USAGE: mitomatic.pl -p [pileup file] -t [bowtie/bwa/maq] -f [Percentage of heteroplasmic frequency]
Format Description

Sample Pileup

mit-o-matic Output:

Report Summary:

The tool summarizes the report of analysis on top of the output page. Herein, the reference genome being used, the haplogroup corresponding to the input variants and counts of some significant characteristics of the report are described. This includes:

Reference sequence:	Reference genome that has been used for analysis
Number of Variants Uploaded:	Count of the variants that have been uploaded by the user
Number of Variants Mapped:	Count of the variants that have already been reported in the widely used databases that include mitomap and mtDB.
Novel Variants:	The variants that have not been reported as of now in the available databases are considered to be novel.
Haplogroup:	It reports the haplogroup to which the uploaded variants belong. This is deciphered from the haplogroup table compiled from the comprehensive phylogenetic tree of global human mitochondrial DNA variation (available at http://www.phylotree.org/builds/mtDNA_tree_Build_15.zip). The haplogroup having the highest number of variants matching to the input variants is reported.
Disease Associated Variants:	Number of variants that have been reported to have association with some mitochondrial disease by mitomap (available at http://www.mitomap.org/bin/view.pl/MITOMAP/MutationsCodingControl)
Number of Heteroplasmic Sites:	Number of times multiple mutations co-exist in various proportions (>10%) within the mitochondrial DNA.
Number of Variants with Population Frequency:	It reports the count of variants for which population frequency could be found.

Below the report summary three different panels give section-wise information regarding annotation of individual variation. Certain parameters common to each panel include variant position, heteroplasmy percentage, reference allele, alternate allele, amino acid change corresponding to mutation, gene within which the variant position falls, type of mutation and population frequency for the variant. In case more than one mutations co-exist in various proportions (>10%) within the mitochondrial DNA, the site is considered to be heteroplasmic and is marked with a * sign.

Nucleotides for which Heteroplasmy percentage is greater than 10% are considered to be the alternate alleles and are reported along with the corresponding percentage in the second column.

Heteroplasmy percentage = (Number of A/T/G/C x 100)/Total Number of Reads

Description of column names:

Column Name	Description
Variant	Variant Position
Heteroplasmy %	Percentage of heteroplasmic frequency
Ref Allele	Reference Allele
Alt Allele	Alternate Allele
AA Position	Amino Acid Position
AA Change	Change in Amino Acid
Gene	Mitochondrial DNA Function Location (Gene Loci)
Disease	Associated Disease
Syn/Non-Syn	Type of mutation - Synonymous or Non - Synonymous
Pop Freq	Population Frequency
Homoplasmy	'+' If Homoplasmic at variant position
Heteroplasmy	'+' If Heteroplasmic at variant position
SIFT Prediction	Whether the non-synonymous mutation is Tolerated or Deleterious

Phenotype Association panel:

All the variations that are associated with phenotype and thus have disease association are reported under this panel. 8th column reports the corresponding disease that might be seen in case of variation at that particular position.

Novel Variants panel:

All the variations that are not reported in any other database as of now are considered to be novel and their corresponding information is shown within this panel. SIFT annotations are provided for the non-synonymous novel variants to analyze the effect of amino acid substitution on the protein function based upon structural change in protein.

Mapped Variants panel:

This panel enlists information regarding those variations that have already been reported in any of the existing databases. Position 3107, represented by an "N" is the error nucleotide in the original CRS that is maintained in the revised sequence as a gap. This error position shall always be detected as one of the heteroplasmic sites.