Back to Science

Comprehensive Quality Control (QC) in NGS Data Analysis: A Step-by-Step Guide

MGI

Science

Sep 24, 2024

Manuel Delpero Ph.D., Bioinformatics scientist

Explore a comprehensive step-by-step guide to quality control (QC) in Next-Generation Sequencing (NGS) data analysis. Learn essential tools and methods, from FastQC to MultiQC, to ensure your NGS data is accurate and reliable for downstream analysis.

🔬As someone who frequently gets asked about how to perform quality control (QC) on Next-Generation Sequencing (NGS) data, I wanted to share a detailed guide that outlines the essential steps. Proper QC is crucial for ensuring that your NGS data is accurate, reliable, and ready for downstream analysis. Whether you’re new to NGS or just looking to refine your workflow, this guide will help you achieve the highest quality data.

Assess Raw Data Quality: FastQC: Begin by evaluating your raw sequencing reads with FastQC. This tool provides initial insights into base quality scores, GC content, and overrepresented sequences, helping you identify any immediate issues with your data.
Trim and Filter Reads: Adapter Removal: Remove adapter sequences and low-quality bases using tools like Trimmomatic, Cutadapt, or SOAPnuke. The best trimming tool often depends on the sequencing platform used, so choose the one most appropriate for your data. Quality Filtering: Apply filters to remove reads with low Phred scores (e.g., below 20) to ensure that only high-quality reads are retained for further analysis.
Evaluate Alignment Quality: Mapping Quality: Align your reads to a reference genome using tools like BWA or Bowtie2. After alignment, evaluate the quality of your mapping with tools such as SAMtools, Picard, and Qualimap. These tools provide a range of metrics, including alignment rate, mismatch rate, coverage uniformity, and duplication levels, which are crucial for ensuring data integrity.
Perform Variant Calling QC: Variant Quality Filtering: For variant calling, use tools like HaplotypeCaller from GATK. Ensure you apply appropriate filters based on quality scores, depth, and strand bias to minimize false positives and enhance the reliability of your findings. Functional Annotation: Annotate identified variants using tools like ANNOVAR, SnpEff, or VEP (Variant Effect Predictor) from Ensembl. These tools help prioritize variants with potential biological significance, enabling more focused downstream analysis.
Generate Final QC Reports: Comprehensive Reporting with MultiQC: After completing all QC steps, use MultiQC to aggregate the results from all the tools used into a single, comprehensive report. This final step ensures that you have a complete overview of the quality of your data, allowing for easy interpretation and comparison across samples.
Automate Your Workflow: To streamline your pipeline, consider using workflow management tools like Snakemake or Nextflow. Automation ensures consistency, efficiency, and reproducibility across different projects, saving time and reducing the risk of human error.

By following these steps, you can confidently ensure that your NGS data is of the highest quality, setting the stage for accurate, reliable, and meaningful results. Happy sequencing! 🎉

🔗 Useful Links:

FastQC - https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
MultiQC - https://multiqc.info/
Trimmomatic - http://www.usadellab.org/cms/?page=trimmomatic
Cutadapt - https://cutadapt.readthedocs.io/en/stable/
SOAPnuke - https://github.com/BGI-flexlab/SOAPnuke
BWA - http://bio-bwa.sourceforge.net/
Bowtie2 - http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
SAMtools - http://www.htslib.org/
Picard - https://broadinstitute.github.io/picard/
Qualimap - http://qualimap.conesalab.org/
GATK HaplotypeCaller - https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-HaplotypeCaller
ANNOVAR - https://annovar.openbioinformatics.org/en/latest/
SnpEff - http://snpeff.sourceforge.net/
VEP (Variant Effect Predictor) from Ensembl - https://www.ensembl.org/info/docs/tools/vep/index.html
Snakemake - https://snakemake.readthedocs.io/en/stable/
Nextflow - https://www.nextflow.io/

Understanding Somatic and germline mitations

Understanding Somatic & Germline Mutations

Advances in Preimplantation Genetic Diagnosis: Reproductive Solutions Powered by NGS

Reshapingh Diagnosis and care for rare and undiagnosed disease

Reshaping Diagnosis and Care for Rare and Undiagnosed Diseases in Turkey

Share this article :

Join our newsletter to stay up to date on features and releases.

I have read and understood MGI’s Privacy Policy, and I consent to the collection and processing of my personal data for handling, responding to my contact, receiving your newsletter as well as promotion and marketing activities.

Resources

Products

Bioinformatic Products

Novel Products

About MGI

Who we are