NGS data analysis

Now that we have sequenced the DNA sample, we need to analyze the data outcome. The raw image data (pictures of the fluorescence-labeled nucleotides) is very large; it can be up to 1 Terabyte! The sequencing machine is able to do some data processing to reduce the file size. The data analysis process for Next Generation Sequencing can be divided into three steps:

Primary analysis

The primary data analysis has 3 parts. Part 1 shows the Illumina sequence identifiers, the read sequence, and the phred quality score. Part 2 is a table explaining the meaning of each part of the Illumina sequence identifier for the identifier called H W U S I E A S 100 R, 6, 73, 941, 1973, hashtag 0, 1. The unique instrument name is given at the beginning. Then 6 is the flow cell lane. 73 is the tile number within the flow cell name. 941 is the x coordinate of the cluster within the tile. 1973 is the y coordinate of the cluster within the tile. Hashtag 0 is the index number for a multiplexed sample and 0 means no indexing. 1 is the member of a pair for pair end or mate pair reads only. Part 3 is a table with columns for the phred quality score, probability of incorrect base call, and base call accuracy. A phred quality score of 10 represents the probability of an incorrect base call as 1 of 10, and a base call accuracy of 90%. As the pred quality score increases, the probability of an incorrect base call decreases, and the base call accuracy increases. At a phred quality score of 50, the probability of an incorrect base call is 1 of 1 hundred thousand, and the base call accuracy is 99.9%.

Figure 1: Primary Data Analysis.

Primary analysis includes all the steps required to call or identify each base. Besides identifying the bases, the sequencing machine will also assign a quality score for each of the bases. The most common outcome is stored as a FASTQ file (see image), containing the sequence identifiers, the assigned nucleotides (A, G, T or C) also called "reads" and the associated Phred quality score. When a nucleotide is assigned to N, it implies that the machine cannot determine the exact nucleotide. The Phred quality score refers to the probability of incorrect base calling. The primary analysis is typically performed in the sequencing machine automatically after each run.

Secondary analysis

Secondary analysis is performed after the primary analysis. When you want to sequence several samples together in one run (for example, from different patients or different experiments) you can assign a specific tag to each of them. The tag, also known as the barcode, is a short DNA sequence that is added to the adapter to differentiate reads from each sample. This tag will also be sequenced, and by identifying the specific adapter sequence for each sample, you will be able to separate them from each other. This is also called multiplexing and has the great advantage of lowering the sequence cost and getting a larger sample. The first step to do before performing the secondary analysis is to trim out the tag and adapters because these sequences do not have a biological meaning.

The main aim of secondary analysis is to assemble all those short DNA sequences (also called reads) so that we can interpret the sequence data. Before this reassembly, the “raw” reads from the machine are often assessed and filtered for quality to produce the best results, removing reads that have low Phred quality scores. When the reassembly is performed from scratch without any reference genome, it is referred to as de novo assembly. However, when there is a reference genome available, the process is much simpler because we can simply align all the reads to the reference genome.

Normally we would have several reads mapping the same area of the genome; this is often referred to as "read depth". The read depth measures how many times that area is covered with different reads, for example, a read depth of 10 implies that there are 10 reads mapping on top of each other in the same genomic area.

Tertiary analysis

Tertiary analysis is necessary in order to understand and make sense of the sequencing result. It includes variant calling, and the actual analysis (for example SNP profiling, genome-wide association study, finding chromosomal aberrations, and others)

Variant calling is the process of accurately determining the variations (or differences) between a sample and the reference genome. These may be in the form of single nucleotide variants, smaller insertions or deletions (called indels), or larger structural variants of categorizations such as transversions, translocations, and copy number variants.

There are specific variations that are characteristic of ancient DNA samples, for example the C > T at the 5' end and G > A at the 3' end. Using these characteristics, we can identify the ancient DNA and separate them from contaminating modern DNA.

After identifying variations that are present in the sample, we can then analyze and try to understand the biological impact of these variations, for example, by performing SNP analysis. The difference in one nucleotide can result in differential gene expression that gives rise to a specific phenotype; you can read some of these SNP examples in ancient Greenland SNP.