| |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
|
Description
of Cluster Analysis Algorithms used in GenePool GenePool takes the cluster analysis approach in that cases and controls are considered as two separate clusters and SNPs whose allelic frequency differences maximize the separation between the clusters are to be selected. Figure 1 shows this concept for Affymetrix genotyping arrays. We have omitted here a discussion of platform specific details, such as the concept of quartets as used in Affymetrix genotyping arrays and Beads in the Illumina arrays. An in-depth discussion as it applies to analyzing pooling data could be found in [1][7]. In
the case of individual genotyping, calling algorithms are used to ascertain
homozygous (AA or BB) or heterozygous (AB or BA) SNPs. In case of pooling, it
is also possible to approximate the relative allele
frequency of alleles for a SNP using the proportional abundance of A and B
alleles. This is the key concept in pooling--the relative ratio of the A and B
alleles correlates to the percent distribution of an allele within a pool. The
question, then, is how best to transform the data in a manner that has an
intuitive meaning. Various formulae are employed for this data transformation,
but the concept of Relative Allele Signal (RAS) is perhaps the most intuitive
formula that still retains correlation to the allele frequency within the pool.
Here, RAS=A/(A+B),
i.e., the ratio of signal arising from A allele to the total signal. RAS values
close to 1 indicate A allele homozygosity (AA case),
values close to 0 indicate B allele homozygosity (BB
case), and intermediate values between 0 and 1 are obtained depending on the
relative abundance of A and B alleles for the SNP. Alternate approaches to
compute RAS values include using a k-correction
factor to account for SNP-specific uneven amplification and/or hybridization. In that
case, the modified formula will be kRASi=A/(A+kiB), where RASi is the predicted allelic
frequency and ki
is a SNP dependent correction factor for the ith SNP. Similarly, arctan(B/A)
could also be used. In the results presented in the study, we have used these
three RAS value transformations along with several different algorithms for
cluster analysis as described next.
Figure 1: Example of Cluster Analysis
on RAS values from Affymetrix Data Mathematically,
given a data set consisting of N features
(SNPs) and their values (quantified allelic frequency differences) for two
classes C0 (controls) and C1 (cases), the goal is to
analytically identify features that provide the best discrimination between
cases and controls. Let X(i,j), i=1,...,n0,...,(n0+n1=N);j=1,...,P
be a two dimensional data matrix representing the allelic frequencies of a
single SNP genotyped on n0
controls and n1 cases
replicates. For Affymetrix, the dimensions 1,..., P
correspond to quartets. For Illumina, P=1, i.e., each data point represents
the allelic frequency for an individual bead. Let Ci, iÎ The
separation between two classes can be quantified using several methods, such as
the Silhouette distance [2]
and SNPs are ranked in descending order of
the separation. The higher the separation, the more likely a SNPs is associated
with the phenotype. In addition, several different mathematical formulae, such
as the Euclidean Distance, Manhattan Distance, could be used to compute
pair-wise difference between data points. Within the GenePool
framework, we have implemented several cluster analysis methods and
mathematical details can be found in Appendix. Thus, a combination of various
approaches to compute RAS values and cluster analysis methods provides several
possibilities to analyze the data. Each combination will examine the data in a
different way and would potentially generate different lists for prioritized
SNPs. Pair-wise
Distance between Two Vectors: Let X and Y be two vectors each with n features.
Let Xi and Yi denote the value of the ith feature. Then, the Euclidean distance (Euc), Manhattan Distance (Man) ,
and Modified Manhattan Distance (Mod) are defined respectively as follows:
In general, we denote the pair-wise distance by d(X,Y) where d could be substituted by appropriate
method above. Relative
Allele Strength: Let A and B denote the image intensity
values obtained by interrogating for two possible alleles for a bi-allelic SNP.
The relative allele strength calculated using different formulae indicates the
homozygous or heterozygous nature of the SNP as follows.
Cluster
Analysis Methods: Let Ci, iÎ Silhouette
Statistic[5]:
The silhouette value for a point is a measure of how
similar that point is to points in its own class compared to points in other
classes, and ranges from -1 to +1. It is defined as where
a(i) is the average distance from the ith point to the other points in its cluster, and b(i,k) is the average distance from the ith
point to points in another cluster k. In the case-control study, there are only
two clusters, and therefore the formula reduces to
Where a(i)
is the average distance from the ith point to the
other points in its cluster, and b(i) is the average
distance from the ith point to points in the other
cluster. For a given SNP, the silhouette value is the average
of all silhouette values in cases and controls value. Values closer to +1
indicate higher separation between cases and controls and the SNPs is more
likely to be associated with the phenotype being studied. Centroid
Distance: Centroid distance is pair-wise distance between the
centroids of two classes.
Dunn
Index[6]: The Dunn
index is defined as the ratio between the minimum distance
between two clusters and the size of the largest cluster.
Consistency
Score: Consistency score measures the extent to which the
difference in the magnitude between corresponding dimensions of two vectors is
in the same direction (positive or negative).
If direction of the consistency is to be maintained,
actual value of Consistency(X, Y) instead of absolute value is used.
Abbreviations The abbreviations used in the text/figures for
various cluster analysis methods are as follows: Silhouette Index (Sil),
Centroid (Centr), Dunn Index (DunIn),
Consistency Undirectional (CoUdr),
Consistency Directional (ConDir), T-Test (Ttest), Modified T-Test (MdfdT) REFERENCES
COPYRIGHT
| |||||||||||||||||||||||||||||
|
File last modified: Fri Jul 18 09:40:22 2008 | |||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||