genepool
NAME
|
genepool − analyze genotyping data from pooled
genomic DNA
|
SYNOPSIS
|
gpcommand [options]
gpextract [options]
gpanalyze [options]
|
DESCRIPTION
|
GenePool is a software package that provides analysis
tools for the detection of shifts in relative allele
frequency between pooled genomic DNA from cases and controls
using SNP-based genotyping microarrays. GenePool is
currently Affymetrix-centric however development efforts are
underway to add the ability to incorporate data from other
platforms including Illumina.
The GenePool system consists of two executable programs,
gpextract(1) and gpanalyze(1), and one perl
script gpcommand(1):
gpextract uses the Affymetrix Fusion SDK library
to extract intensity values from Affymetrix CEL files and
write them to a customized, more compact binary file format.
The new files have the same name as the original CEL file
but with the string .gpb (GenePool Binary) appended
to each filename.
gpanalyze takes the intensity values from the .gpb
files and uses a variety of data analysis methods to assign
a score to each SNP where the score indicates how
significant is the observed difference in allele frequency
between the hybridizations for the two DNA pools. Note that
the scores are not p-values.
gpcommand is a perl script that helps users run
basic pooling analyses by reading a configuration file and
automatically invoking the other two GenePool programs. New
GenePool users should almost certainly start with
gpcommand and move on to direct use of
gpextract and gpanalyze once they are
confident that they understand the system.
|
|
The gpextract program uses the Affymetrix Fusion
SDK library to read the native CEL and CDF files so GenePool
users will have to get a copy of the Fusion SDK source code
if they want to compile GenePool from source code. This adds
a dependency to the GenePool system but saves us having to
create and maintain code to read all of the different types
and versions of Affymetrix files. The Fusion SDK is written
in C++ which is why gpextract is written in C++ while
gpanalyze is written in C which the GenePool
developers are more comfortable with.
|
|
The Affymetrix Fusion SDK relies on the "C"
version of the Apache Xerces XML Parser so this is
effectively also a dependency for GenePool. If in future we
determine that GenePool is not using any Affymetrix code
that uses Xerces, we may be able to remove Xerces as a
dependency for GenePool.
|
INSTALLATION
|
There are two distributions for GenePool - binary and
source - and each has its own installation instructions. In
general, if a binary distribution is available for your
machine architecture and operating system (Intel x86 Linux,
PowerPC Mac, SPARC Solaris etc) then your easiest option is
to try the binary distribution first. If the binaries don't
work for you then please contact us with the details so we
can try to remedy the problem. If there is not a suitable
binary distribution for you or you would like to get the
best possible performance from GenePool then you should
probably try installing from source code.
|
|
1.
|
|
Obtain a copy of the GenePool binary distribution file
that matches your operating system and architecture and
uncompress it. A command something like:
|
|
gunzip -c GenePool-bin-linux-0.2.0.tar.gz | tar xvf
-
should work on most unix machines.
|
|
2.
|
|
Execute ./configure to configure the local
installation ready for installation. If you do not have
permissions to install programs and man pages into the
/usr/local/ directory then you will need to specify
an alternative install location using a command of the form
./configure --prefix=/your/alternate/path.
|
|
3.
|
|
There is nothing to build so you can skip the usual
"make" step and go straight to executing make
install to install the executables and man pages.
|
|
You should now be ready to GenePool. Two caveats - to run
the executables, wherever you installed the executables
(/usr/local/bin by default) needs to be in your PATH
environment variable; and to see the manpages, wherever you
installed the man pages (/usr/local/man by default)
needs to be in your MANPATH. If you cant run the programs or
see the man pages then you may need to have your systems
adminstrator help you set up your PATH and MANPATH
environment variables.
|
|
To compile GenePool from source, you will need 3 source
code distributions - GenePool, Affymetrix Fusion SDK, and
Apache Xerces XML parser (C-version). Full instructions for
obtaining each is given below. If you are not compiling on
Linux, you may also need a copy of the GNU Autotools
(autoconf and automake) so that you can construct a valid
configure script (see step 3 below).
At some point during the Affymetrix download process, you
will be required to log into the Affymetrix Developer
Nextwork (ADN) which means you must register for an ADN
account if you havent already. Registering for the ADN is
free and is probably a good idea if you are regularly
analyzing Affymetrix data since the ADN pages contain useful
data files and software as well as forums where Affymetrix
software developers will answer questions. You will also
have to accept the license terms for the Affymetrix Fusion
library or the download will be blocked.
|
|
1.
|
|
Obtain a copy of the GenePool source code distribution
file and uncompress it. A command something like:
|
|
gunzip -c GenePool-0.0.2.tar.gz | tar xvf -
should work on most unix machines.
|
|
2.
|
|
Obtain and compile a copy of the Affymetrix Fusion SDK
and the "C" version of the Apache xerces XML
processor as detailed in the Compiling the Affymetrix
Fusion SDK section below. The Fusion and xerces source
code will be used to create a Fusion library file
(libfusion.a) and we will need that library plus the
original Fusion and xerces header files (.h) during
compilation and linking of the GenePool executables
gpextract and gpanalyze.
|
|
3.
|
|
The GenePool source code distribution contains a
configure script but it was built on a Linux box so
if you are compiling on any other platform you may need to
regenerate this script to have it correctly tailored to your
platform. To do this you will need a copy of autoconf
and automake which are part of the GNU Autotools.
Assuming you have these tools installed, all you need to do
is execute autoreconf which will read the
configure.ac file and regenerate configure.
|
|
4.
|
|
Execute ./configure to configure the local
installation ready for compilation. If you do not have
permissions to install programs and man pages into the
/usr/local/ directory then you will need to specify
an alternative install location using a command of the form
./configure --prefix=/your/alternate/path.
|
|
5.
|
|
Execute make to compile and link the source
code.
|
|
6.
|
|
Execute make install to install the executables
and man pages.
|
|
You should now be ready to GenePool. Two caveats - to run
the executables, wherever you installed the executables
(/usr/local/bin by default) needs to be in your PATH
environment variable; and to see the manpages, wherever you
installed the man pages (/usr/local/man by default)
needs to be in your MANPATH. If you cant run the programs or
see the man pages then you may need to have your systems
adminstrator help you set up your PATH and MANPATH
environment variables.
|
|
Compiling the Affymetrix Fusion SDK |
|
You should alread have completed step 1 of the Source
Code Installation section above so you should already
have a directory containing the uncompressed source code for
GenePool. To create a compiled Affymetrix Fusion SDK library
ready for linking with the GenePool source code:
|
|
1.
|
|
Go to the Affymetrix website
(http://www.affymetrix.com), and click on the
Support tab at the top of the page. On the next page,
click on the Developer Network link from the menu of
links on the left side of the page. This will take you to
the home page of the Affymetrix Developer Network (ADN)
where a link to the Fusion SDK is available. When you reach
the download page, you'll want the "Full SDK".
|
|
2.
|
|
Copy the Fusion SDK distribution file (usually called
something like affy-fusion-release-107.zip)
inside the GenePool source code directory and unzip
it. This should create a directory called affy/ in
which case you can safely skip step 3. If unzipping the
fusion distribution creates a directory called
cvs-head or any name other than affy/ then you
will need to do step 3.
|
|
3.
|
|
Edit the SDK_DIR variable in
Makefile.FusionSDK so that it points to the
"root" of the Affymetrix Fusion code that was
uncompressed in step 2. The "root" directory is
called sdk/ and it should contain a heap of
subdirectories including calvin_files/,
files/, and file_formats/. You may have to
browse through the fusion distribution to find the sdk/
directory.
|
|
4.
|
|
Go to the Apache Xerces XML parser website
(http://xerces.apache.org), and click on the
Xerces C link in the menu on the left margin of the
page. On the next page, click on the Download link in
the menu on the left margin of the page. You should now be
on the Download page for the C version of the Xerces XML
parser so scroll down until you find a section titled
Current Source Releases of Xerces-C. You can download
the .zip or .tar.gz file but we’ll
assume you took the .tar.gz version.
|
|
5.
|
|
Place the Xerces distribution file (usually called
xerces-c-current.tar.gz) inside the GenePool
source code directory and uncompress it. A command something
like:
|
|
gunzip -c xerces-c-current.tar.gz | tar xvf -
should work on most unix machines.
|
|
6.
|
|
Edit the XERCES_ROOT variable in
Makefile.FusionSDK so that it points to the
"root" of the xerces-c code that was uncompressed
in step 5. The xerces directory name usually incorporates
the version number (for example xerces-c-src_2_7_0/)
so you are almost certainly going to have to edit the
default xerces directory that appears in
Makefile.FusionSDK.
|
|
7.
|
|
Execute make --file=Makefile.FusionSDK which will
compile and link the Fusion and xerces code and create a
libfusion.a library that we can link gpextract
and gpanalyze against. This process could take up to
10 minutes depending upon the power of your CPU.
|
FILE FORMATS
|
The GenePool binaries gpextract and
gpanalyze generate and process many different plain
txt files. This section provides a brief outline of the role
and format of each of these files. Unless specified
otherwise, all plain text files should be tab-delimited and
should have Unix-style line endings - a single
"LineFeed" character.
|
|
This file is read by gpanalyze and contains a line
for each platform in the analysis. A platform is a chip type
so a pooling experiment run on the Affymetrix 10K platform
would have a single line but an experiment run on the
Affymetrix 100K platform would have 2 lines - one for the
Hind chips and one for the Xba chips. An Affymetrix 500K
experiment would also have 2 lines - one for the Sty chips
and one for the Nsp chips. All current Illumina HumanHap
platforms are single chips not chipsets so Illumina-based
experiments will only have a single line in
Experiment.txt.
|
|
CasesFile NumCasesFile ControlsFile NumControlsFile
SNPNames
|
|
where the description of each item is:
|
|
|
CasesFile - file containing a list of Cases datafiles
NumCasesFile - number of files listed in CasesFile
ControlsFile - file containing a list of Controls
datafiles
NumControlsFile - number of files listed in ControlsFile
SNPNames - file containing IDs of the SNPs on the chip
|
|
For example, an Experiment.txt file for an
Affymetrix 500K pooling experiment might look like:
|
|
CasesStyFiles.txt 6 ControlStyFiles.txt 6
StySnpNames.txt
CasesNspFiles.txt 6 ControlNspFiles.txt 6
NspSnpNames.txt
|
|
This file contains a list of the names of
gpextract processed datafiles for Cases. This file is
required by gpanalyze and its name is placed in the
first column of the Experiment.txt file detailed
above. If the file is not in the current directory then the
filename should contain an absolute pathname. Each filename
should be placed on a seperate line as shown in the example
below: |
|
CasesNsp1.cel.gpb
CasesNsp2.cel.gpb
CasesNsp3.cel.gpb
CasesNsp4.cel.gpb
CasesNsp5.cel.gpb
|
|
This structure of this file is identical to the
CasesFiles file detailed above but the contents are a
list of the gpextract processed datafiles for
Controls. This file is required by gpanalyze and its
name is placed in the third column of the
Experiment.txt file detailed above.
|
|
This file contains information about each SNP on the
platform. This file is required by gpanalyze and its
name is placed in the fifth column of the
Experiment.txt file detailed above. This file is
generated differently for Affymetrix and Illumina chips
since the SNPs appearing on an Affymetrix chip are
predetermined whereas each Illumina chip may contain a
slightly different number of SNPs since some SNPs may be
represented by too few beads to be considered a valid
measurement.
|
|
In the case of both Affymetrix and Illumina platforms
the file contains the same 3 columns of data with one line
for each SNP:
|
|
SNPName SerialNo DefaultRank
|
|
SerialNo MUST be UNIQUE for every SNPName. It is
used to keep track of the order in which SNPs were extracted
from the raw image intensity files. It is a 9 digit integer
which allows for arrays with up to 999 million SNPs. By
default, the DefaultRank field is set to 1 for each SNP in
the SNPNames file. This field is used in a multistage
analysis where gpanalyze creates a new SNPNames file
for each analysis stage and populates the DefaultRank field
with the rank of the SNP in that stage which allows
subsequent stages to filter based on rank.
|
|
The order in which the SNPs occur in this file is
critical to a successful analysis - they must be in
EXACTLY the same order as the SNPs occur in the
datafiles output by gpextract. Since the datafiles
produced by gpextract are in binary format there is
no way for a user to work out the order of the SNPs within
the file so the user cannot expect to create the
SNPNames file manually. Every gpextract run
produces a SNPNames file in addition to the datafile
and within a platform, the SNP order produced by
gpextract is always the same so for each platform the
user just has to use one of these SNPNames files
produced by gpextract. Typically the filename
includes the Enzyme type, for example: NspSnpNames.txt,
HindSnpNames.txt XbaSnpNames.txt, etc.
|
|
For Illumina data, a SNP could be missing on one or more
chips so the SNPs must be ordered in increasing numerical
order which matches the order in which gpextract
writes Illumina intensities into the .gpb binary
datafile.
|
|
This is the analysis file output by gpanalyze. It
contains 5 columns: |
|
SNPName Score CaseValues ControlValues SerialNo
|
|
where SNPName is the identifier for the SNP,
Score shows the degree of separation computed using
the chosen analysis algorithm, CaseValues and
ControlValues indicate how many case/control points
were available, and SerialNo is same as in the
description provided above for the SNPNames file.
|
|
Note that for Illumina, the values for CaseValues
and ControlValues will be equal to the total number
of beads on the cases and controls chips for this SNP
whereas for Affymetrix, the values are calculated as being
NumberOfChips*NumberOfQuartets.
|
|
The first five columns of this file are the same as for
Output.txt but is has a 6th column which contains the
rank. This file is sorted in descending numerical order on
the score field.
|
|
The user can optionally supply this file which contains
annotation information about the SNPs on the arrays. If
supplied, it allows gpanalyze to produce annotated
versions of the Output.txt file. The file contains 4
columns: |
|
SNPId dbSNPId Chromosome Base
|
|
8. ChromosomeSortedAnnotated.txt |
|
This output file is sorted ascending on chromosome
(primary key) and basepair location (secondary key). The
file has 5 columns: |
|