NIPT. Bioinformatic analysis

F-Genetics

The algorithm consists of processing raw sequencing data (fastq). Processing is performed by demultiplexing samples (separation by index sequences), then trimming (cutting off index and adapter sequences), marking PCR duplicates, mapping to the reference genome, and calculating the number of reads mapped to individual genome regions. After mapping, the level of deviation is estimated to determine the karyotype.

Bioinformatics analysis algorithm

The resulting raw data on the fluorescence of DNA fragments is subjected to basecalling to get fastq files containing information about the genetic sequence and phred-score sequencing quality. The data carries information about the index sequence, which allows demultiplexing, that is, assigning the information obtained for each individual nanosphere to the test sample.

After basecalling and demultiplexing, index and adapter sequences are trimmed. Trimming consists of cutting off the service parts of the read in order not to disrupt further analysis. Trimming is also carried out according to quality not to use the part of the read characterized by low sequencing quality in further analysis.

Next, the reads are mapped to the reference genome of the human hg19. This sequence is characterized by the absence of alternative contigs and parts of chromosomes, which makes it possible to more accurately estimate the amount of reads mapped to each individual chromosome. Mapping is performed using the Barrows-Wheeler transform. Data on the mapping of each individual sample is in the .sam format file.

After mapping, the format . sam is converted to binary format . bam, which significantly speeds up the work of further algorithms. The file is sorted in such a way that at the beginning of the file there are reads mapped to each of the contigs (in the case of this document, chromosomes) in alphabetical order, and inside the contig — by increasing the coordinate of the read start. PCR labeling of duplicates is performed, which makes it possible to exclude from the analysis reads that are exact duplicates of other reads.

Then, the number of reads mapped to each of the individual sections of the chromosome of equal size, called bin, is estimated. For each bin a standardized assessment is performed, it consists in estimating the number of standard deviations of the bin coverage for a sample of all bin coverages. In the case of a high z-score, it is concluded that there is a trisomy on the chromosome under study.

The algorithm is implemented in Python using object-oriented programming approaches.