Bioinformatics Research Unit > Software > GenePool > Documentation > Clustering

Description of Cluster Analysis Algorithms used in GenePool

GenePool takes the cluster analysis approach in that cases and controls are considered as two separate clusters and SNPs whose allelic frequency differences maximize the separation between the clusters are to be selected. Figure 1 shows this concept for Affymetrix genotyping arrays. We have omitted here a discussion of platform specific details, such as the concept of quartets as used in Affymetrix genotyping arrays and Beads in the Illumina arrays. An in-depth discussion as it applies to analyzing pooling data could be found in [1][7].

In the case of individual genotyping, calling algorithms are used to ascertain homozygous (AA or BB) or heterozygous (AB or BA) SNPs. In case of pooling, it is also possible to approximate the relative allele frequency of alleles for a SNP using the proportional abundance of A and B alleles. This is the key concept in pooling--the relative ratio of the A and B alleles correlates to the percent distribution of an allele within a pool. The question, then, is how best to transform the data in a manner that has an intuitive meaning. Various formulae are employed for this data transformation, but the concept of Relative Allele Signal (RAS) is perhaps the most intuitive formula that still retains correlation to the allele frequency within the pool. Here, RAS=A/(A+B), i.e., the ratio of signal arising from A allele to the total signal. RAS values close to 1 indicate A allele homozygosity (AA case), values close to 0 indicate B allele homozygosity (BB case), and intermediate values between 0 and 1 are obtained depending on the relative abundance of A and B alleles for the SNP. Alternate approaches to compute RAS values include using a k-correction factor to account for SNP-specific uneven amplification and/or hybridization. In that case, the modified formula will be kRASi=A/(A+kiB), where RASi is the predicted allelic frequency and ki is a SNP dependent correction factor for the ith SNP. Similarly, arctan(B/A) could also be used. In the results presented in the study, we have used these three RAS value transformations along with several different algorithms for cluster analysis as described next.

 

Figure 1: Example of Cluster Analysis on RAS values from Affymetrix Data

 

Mathematically, given a data set consisting of N features (SNPs) and their values (quantified allelic frequency differences) for two classes C0 (controls) and C1 (cases), the goal is to analytically identify features that provide the best discrimination between cases and controls. Let X(i,j), i=1,...,n0,...,(n0+n1=N);j=1,...,P be a two dimensional data matrix representing the allelic frequencies of a single SNP genotyped on n0 controls and n1 cases replicates. For Affymetrix, the dimensions 1,..., P correspond to quartets. For Illumina, P=1, i.e., each data point represents the allelic frequency for an individual bead. Let Ci, iÎ{0, 1}, denote respectively control and case classes. Let d(i, j) denote the pair-wise difference between two data points. Let D(C0, C1) denote the separation between the two classes computed from d(i, j) using some analysis method. Then, D(C0, C1) effectively provides a basis to rank SNPs based on their allelic frequency differences between cases and controls.

The separation between two classes can be quantified using several methods, such as the Silhouette distance [2] and SNPs are ranked in descending order of the separation. The higher the separation, the more likely a SNPs is associated with the phenotype. In addition, several different mathematical formulae, such as the Euclidean Distance, Manhattan Distance, could be used to compute pair-wise difference between data points. Within the GenePool framework, we have implemented several cluster analysis methods and mathematical details can be found in Appendix. Thus, a combination of various approaches to compute RAS values and cluster analysis methods provides several possibilities to analyze the data. Each combination will examine the data in a different way and would potentially generate different lists for prioritized SNPs.

Pair-wise Distance between Two Vectors:

Let X and Y be two vectors each with n features. Let Xi and Yi denote the value of the ith feature. Then, the Euclidean distance (Euc), Manhattan Distance (Man) , and Modified Manhattan Distance (Mod) are defined respectively as follows:

In general, we denote the pair-wise distance by d(X,Y) where d could be substituted by appropriate method above.

Relative Allele Strength: Let A and B denote the image intensity values obtained by interrogating for two possible alleles for a bi-allelic SNP. The relative allele strength calculated using different formulae indicates the homozygous or heterozygous nature of the SNP as follows.

Formula

Denoted By

Value for AA homozygous

Value for AB heterozygous

Value for BB homozygous

AAB

1

0.5

0

AkB

K=correction factor [3, 4]

1

Depends on K

Depends on K

ATn

0

 

Cluster Analysis Methods:

Let Ci, iÎ{0, 1}, denote respectively control (C0) and case classes (C1) containing n0 and n1 data points. Let d(i, j) denote the pair-wise difference between two n-dimensional data points Xi and Xj. Let D(C0, C1) denote the separation between the two classes. D(C0, C1) can be computed from d(i, j) using various analysis methods described next.

Silhouette Statistic[5]:

The silhouette value for a point is a measure of how similar that point is to points in its own class compared to points in other classes, and ranges from -1 to +1. It is defined as

where a(i) is the average distance from the ith point to the other points in its cluster, and b(i,k) is the average distance from the ith point to points in another cluster k. In the case-control study, there are only two clusters, and therefore the formula reduces to

Where a(i) is the average distance from the ith point to the other points in its cluster, and b(i) is the average distance from the ith point to points in the other cluster.

For a given SNP, the silhouette value is the average of all silhouette values in cases and controls value. Values closer to +1 indicate higher separation between cases and controls and the SNPs is more likely to be associated with the phenotype being studied.

Centroid Distance:

Centroid distance is pair-wise distance between the centroids of two classes.

Dunn Index[6]:

The Dunn index is defined as the ratio between the minimum distance between two clusters and the size of the largest cluster.

Consistency Score:

Consistency score measures the extent to which the difference in the magnitude between corresponding dimensions of two vectors is in the same direction (positive or negative).

If direction of the consistency is to be maintained, actual value of Consistency(X, Y) instead of absolute value is used.

Abbreviations

The abbreviations used in the text/figures for various cluster analysis methods are as follows:

Silhouette Index (Sil), Centroid (Centr), Dunn Index (DunIn), Consistency Undirectional (CoUdr), Consistency Directional (ConDir), T-Test (Ttest), Modified T-Test (MdfdT)

REFERENCES

  1. Tembe, W.D., et al. Analysis Software for High-density Pooled Genotyping Data. in International Conference on Bioinformatics and Computational Biology (BIOCOMP 2007). 2007. Las Vegas.

  2. Lovmar, L., et al., Silhouette scores for assessment of SNP genotype clusters. BMC Genomics, 2005. 6(1): p. 35.

  3. Craig,D.W., et al., Identification of disease causing loci using an array-based genotyping approach on pooled DNA. BMC Genomics, 2005. 6: p. 138.

  4. Hoogendoorn, B., et al., Cheap, accurate and rapid allele frequency estimation of single nucleotide polymorphisms by primer extension and DHPLC in DNA pools. Hum Genet, 2000. 107(5): p. 488-93.

  5. Rousseeuw, P., Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987. 20: p. 53-65.

  6. Azuaje, F., A cluster validity framework for genome expression data. Bioinformatics, 2002. 18(2): p. 319-20.

  7. Pearson JV, et al. Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am J Hum Genet. 2007 80(1): p. 126-39.

COPYRIGHT

GenePool is copyright 2006-2008 by The Translational Genomics Research Institute. All rights reserved. This License is limited to, and you may use the Software solely for, your own internal and non-commercial use for academic and research purposes. Without limiting the foregoing, you may not use the Software as part of, or in any way in connection with the production, marketing, sale or support of any commercial product or service. For commercial use, please contact licensing@tgen.org. By installing this Software you are agreeing to the terms of the LICENSE file distributed with this software.

In any work or product derived from the use of this Software, proper attribution of the authors as the source of the software or data must be made. The following URL should be cited:

http://bioinformatics.tgen.org/software/genepool/