ParAllele BioScience ParAllele Technology and Applications

Linkage Disequilibrium SNP Mapping Studies

Whole Genome Association Studies


The Problem: Genetics of Complex Disease

For many complex diseases, genetics play an important role in predicting whether an individual develops the disease. However, to date there are relatively few genes identified whose variant alleles account for a significant portion of the risk. For example, in schizophrenia the net genetic contribution, calculated from studies comparing the disease concordance in monozygotic twins and adopted siblings, is nearly 60%. However, the most strongly associated gene discovered to date, Neuregulin, is calculated to have only a 2% genetic contribution. Clearly there are other associated genes to be identified that, when combined, have significantly higher predictive power. To find these other genes, a large number of genes and SNPs need to be evaluated. Schizophrenia is not an isolated case; genetic variants strongly associated with diabetes, asthma, osteoporosis, cancer, and many other common conditions have yet to be identified.

Neuregulin alone does not account for the genetic predisposition to schizophrenia.

Figure 1 Figure 1: Predictive power derived from an average of several studies reported in Sadock and Sadock, a comprehensive textbook on psychiatry

The Solution: Gene-Based Genome-wide SNP Analysis

To have the best opportunity of finding the most complete set of associated genes, a broad search of gene-based SNPs is a very powerful approach. Since one-third of the genome consists of gene-based regions, it is initially most effective to target a given number of SNPs to such regions (with 5kb of a transcribed sequence). The structure of the genome (as determined by the HapMap) can be used to choose targeted SNPs, as compared to just random SNPs. The gene-based coverage of 20,000 targeted SNPs is better than 120,000 random genome-wide SNPs (Figure 3). The statistical power of 100,000 targeted (gene-based) SNPs exceeds that of 400,000 random (genome-wide) SNPs for effects in genic regions. Calculations indicate that whole genome scans with these numbers of SNPs can have sufficient statistical power to detect many disease alleles with a relative risk of 1.5 or more, in studies with 1,000 cases and 1,000 controls.

Accurate Measurements Are Required for Predictive Results

Accurate identification of genes of interest requires well-characterized affected and control populations as well as a highly accurate SNP genotyping[insert link to definitions] platform. Compromising on either significantly reduces the power of the study. The key factors that affect accuracy of results include the following:
  1. Pooling Patient Samples
    Pooling patient samples for SNP detection considerably reduces the accuracy of results because of errors in frequency estimates as well as the loss of individual genotypes. For typical complex markers this results in a huge loss of predictive power of more than 80%. It also compromises data analysis by losing information that allows patient stratification, haplotype mapping, and subgrouping.

    Pooling patient samples reduces predictive power.

    Figure 2 Figure 2: Predictive power calculate assuming 100% accuracy for no error, 99% accuracy for genotyping individual samples, and 95% accuracy for genotyping pooled samples

  2. Data Analysis Seeding
    Typically a trade-off exists in data analysis between false positives and false negatives. Setting the p-value threshold too low generates many more false positives, which in turn act as poor seeds for downstream analysis of gene-gene interactions. This compounding effect severely complicates data analysis and compromises the reliability of results. High accuracy is critical for generating reliable results.
  3. SNP Choice
    We have shown that the choice of SNPs is an important factor in whole genome studies. High conversion from in-silico selected SNPs to high-accuracy assays is necessary to ensure that intelligent SNP choice translates to maximally informative data. ParAllele has the highest conversion rate from database SNPs to validated assays.

The ParAllele Whole Genome Solution

The ParAllele whole genome panel comprises a set of 100,000 gene-based SNP assays designed in five panels of 20,000 assays each. Each panel analyzes 20,000 SNPs on all gene-based regions in the human genome. The 100,000 SNP assays represent SNPs that efficiently cover all genes in the human genome. Our flexible approach allows users to perform scans using 20,000 to 100,000 or more SNPs, depending on their genetic model underlying the disease and the study size.

For every gene, SNPs are chosen in the contiguous region from 5kb upstream of the first exon through 5kb downstream of the last exon. All introns are included. Within these intragenic regions, SNPs are prioritized to maximize the r2 correlation coefficient between the tag SNPs and those SNPs genotyped in the International HapMap project. ParAllele's bioinformatics and statistical genetics teams developed the algorithms for SNP selection based on this strategy. You can find specifications for the first 20,000 SNP panel in the Products and Services portion of this website.

Intragenic coverage greater than 80% within 40kb

Figure 3 Figure 3: This metric measures how effectively the 20,000 assay panel covers intragenic regions that comprise one-third of the human genome. This measure is displayed in a cumulative distribution in Figure 3. The vertical axis is the percentage of all intragenic bases in the genome that are within X base pairs (bp) of an SNP assay in the gene-based 20,000 and random 120,000 SNP panels. The horizontal axis is the distance X in bp. Note that 83% of all intragenic bases are within 40kb of an assay in the 20,000 panel. For comparison, 75% of all intragenic bases are within 40kb of an assay in the 120,000 random SNP panel.