User Login     |      Sign Up

Why NGS-QC

Background
Next Generation Sequencing (NGS) based profiling assays like ChIP-seq (Chromatin-ImmunoPrecipitation assay assessed by NGS) and enrichment-related assays may present divergent quality due to the large number of factors that are implicated in their production. In fact, experimental parameters like cross-linking efficiencies in different cell types or tissues, shearing or digestion of chromatin or the selectivity and affinity of an antibody (batch) can vary substantially between experiments and different experimenters and will ultimately impact on the overall quality of the final readout.

In addition to the performance of the immunoprecipitation/enrichment assays, the rapid technological progress provided NGS platforms with largely different sequencing capacities ranging from tens of millions (e.g. Illumina Genome analyzer GA) to >3 billion (HiSeq2000) reads per flow cell. As a consequence, the publicly available databases hosting NGS generated datasets are populated with enrichment profiles presenting a large variety in sequencing depth. Importantly, previous studies have demonstrated that by increasing the sequencing depth, the number of discovered binding sites increases accordingly. Intuitively, it is expected that the number of sequenced reads required to discover all binding events is directly related to their total number and to their binding pattern (i.e. 'broad' regions covering large parts of a genome will require more reads to be properly identified than 'sharp' patterns with few target sites). When evaluating the quality of NGS based profiling, it is therefore important to assess if a given ChIP-seq profile is performed under optimal sequencing conditions, including the minimal sequencing depth required to discover most of the relevant binding events of a given factor.

While a number of analytical methods aiming to address the Quality of ChIP-seq datasets have been described, none of them has been shown to be applicable to the large variety of ChIP-seq and enrichment-related NGS profiling assays. In contrast, our proprietary NGS-QC algorithm has been designed to (i) infer a set of global QC indicators (QCis), which reveal the comparability of different enriched NGS data sets, (ii) provide local QCis to judge the robustness of cumulative read counts ('peaks or islands') in a particular region, (iii) provide guidelines for the choice of the optimal sequencing depth for a given target and, finally, (iv) to have quantitative means of comparing different antibodies and antibody batches for ChIP-seq and related antibody-driven studies.

Importantly, this original concept has been applied for establishing the first QC indicator database covering a large collection of ChIP-seq and enrichment-related datasets retrieved in the public domain. Our team extensively working to cover virtually all publicly available enrichment-related NGS profiling assays, thus users can compare the quality indicators computed by the NGS-QC Generator tool for a given ChIP-seq experiment with the quality indicators for published datasets present in the QC indicator database. This information will guide users toward optimization of their ChIP-seq assays, for instance by the selection of optimal antibody sources that were shown to perform in the public domain as described by the NGS-QC indicators associated to their related studies in our database.

Application
Comparative analyses between Next generation sequencing (NGS) generated profiles, such as ChIP-seq, RNA-seq, Gro-seq, or MeDIP-seq require prior characterization of the degree of technical similarity of the various data sets, as individual profiles can vary significantly even between biological replicates; the use of different antibodies and batch-to-batch variations of the same antibody, sequencing depth and immunoprecipitation (IP) quality are only a few of the parameters that impact on the quality of a ChIP-seq profile. The present NGS-QC Generator infers global and local quality indicators based on a stand-alone approach, as it does not require additional wet-lab efforts. This computational approach generates read count intensity profiles from randomly selected subsets of the total originally mapped reads (TMRs) associated to the NGS profile under study and defines the divergence from the theoretically expected read count intensities (RCIs) after sampling relative to the original profile. For this, TMRs are first randomly sampled at three different densities (90%, 70% and 50%; referred to hereafter as s90, s70 and s50 subsets, respectively); then the genomic RCI profile is recorded for successive 500 bp bins and compared to that of the original profile. This comparison is performed to evaluate the divergence from the ideal condition in which the RCI/bin for a s50 subset correspond to 50% of the original RCI/bin value. Importantly, NGS sampled generated profiles diverge always to different degrees from the hypothesized "ideal behaviour", thereby generating a quantifiable denominator (referred to as profile "robustness"), which is linked to the quality of any NGS generate profile.

Our database can be used to:
  • Evaluate the quality of your favourite ChIP-seq or enrichment-related NGS dataset relative to the datasets present in the entire database through our customized Galaxy platform instance
  • "Google" the QC descriptors for thousands of publicly available NGS generated datasets
  • Search for the QC descriptor ranges of a particular transcription factor, histone modification, or a specific technology, like FAIRE-seq
  • Download QC reports and local QC indicator profiles for a given dataset
  • Visualize the QC of a range of targets in dynamic plots
  • Visit the "database" for a plethora of additional options