
Each of these three steps can be carried out by various alternative programs using different algorithms, which influence the accuracy and sensitivity of the resulting variant set.įirst, read processing can be required if the read quality is at least partially low. Finally, the (III) identification (calling) of sequence variants is performed based on alignments. Variant calling pipelines often start with (I) the preprocessing of sequence reads, followed by (II) the alignment (mapping) of these reads to a reference sequence. The large amount and diverse nature of NGS-data types (as reviewed in ), the diversity of bioinformatics algorithms, and the quality of the reference genome sequence render the choice of the best approach challenging. Īn accurate and comprehensive identification of sequence variants between a sample and the reference sequence is the major challenge in many re-sequencing projects. There are pan-genome projects for various species focusing on the genome evolution and mapping-by-sequencing projects which focus on agronomically important traits of crops. As the number of high-quality reference genome sequences rises continuously, the number of re-sequencing projects increases as well. thaliana population were studied in the 1001 genomes project. Dropping sequencing costs boosted high-throughput sequencing projects, thus facilitating the analysis of this genetic diversity. However, there are subtle differences between individuals of the same species, which are of academic and economic interest as these determine phenotypic differences. When looking at different performance metrices, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.Īs the basis of biological properties and heredity, the genome of a species is a valuable resource for numerous studies. We found that all investigated tools are suitable for analysis of NGS data in plant research.

Sets of variants were evaluated based on various parameters including sensitivity and specificity. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. High-throughput sequencing technologies have rapidly developed during the past years and became an essential tool in plant sciences.
