genepool
NAME
|
genepool − analyze genotyping data from pooled
genomic DNA
|
SYNOPSIS
DESCRIPTION
|
GenePool is a software package that provides analysis
tools for the detection of shifts in relative allele
frequency between pooled genomic DNA from cases and controls
using SNP-based genotyping microarrays. GenePool uses a file
format that accommodates and describes data from multiple
genotype chip platforms. The code behind GenePool makes it
possible for third parties to easily extend the code by
producing their own algorithms; thus, creating an expandable
multiple operating system coding platform. GenePool can be
compiled on Microsoft, OS X, and UNIX like operating
systems.
GenePool consists of one executable program,
genepool. Extraction and analysis of data is achieved
through the use of configuration files. To perform an
extraction the -e option is used with a configuration
file as its arguement. Based off the settings in the
configuration file a Universal Data File (.udf) file
is generated. Like extraction, analysis is performed using
the -a option with a configuration file as its
arguement. Analysis takes intensity values from the
.udf file and uses a variety of data analysis
methods, chosen by the user, to assign a score to each SNP.
The score indicates how significant is the observed
difference in allele frequency between the hybridizations
for the two DNA pools. Note that the scores are not
p-values.
|
|
The genepool -e option uses the Affymetrix Fusion
SDK library to read the native CEL and CDF files so GenePool
users will have to get a copy of the Fusion SDK source code
if they want to compile GenePool from source code. This adds
a dependency to the GenePool system but saves us having to
create and maintain code to read all of the different types
and versions of Affymetrix files. The Fusion SDK is written
in C++ which is why genepool is written in C++.
|
|
The Affymetrix Fusion SDK relies on the "C"
version of the Apache Xerces XML Parser so this is
effectively also a dependency for GenePool. If in future we
determine that GenePool is not using any Affymetrix code
that uses Xerces, we may be able to remove Xerces as a
dependency for GenePool.
|
|
The Boost library is used through out the code to take
advantage of easy to use functions.
|
|
Standard Template Library (STL) |
|
The STL is used through out the code for its various
collection types.
|
INSTALLATION
|
There are several distributions for GenePool, which can
be broken down into binary and source. The different
binaries have been compiled to work on specific operating
systems. In general, if a binary distribution is available
for your machine architecture and operating system (Intel
x86 Linux, PowerPC Mac, SPARC Solaris, Windows, etc) then
your easiest option is to try the binary distribution first.
If the binaries don't work for you then please contact us
with the details so we can try to remedy the problem. If
there is not a suitable binary distribution for you or you
would like to get the best possible performance from
GenePool then you should probably try installing from source
code.
|
|
Binary Installation for UNIX like
enviroments. |
|
1.
|
|
Obtain a copy of the GenePool binary distribution file
that matches your operating system and architecture and
uncompress it. A command something like:
|
|
gunzip -c GenePool_bin-x86_32-linux-gnu-0.9.1.tar.gz |
tar xvf -
should work on most unix machines.
|
|
2.
|
|
Execute ./configure to configure the local
installation ready for installation. If you do not have
permissions to install programs and man pages into the
/usr/local/ directory then you will need to specify
an alternative install location using a command of the form
./configure --prefix=/your/alternate/path.
|
|
3.
|
|
There is nothing to build so you can skip the usual
"make" step and go straight to executing make
install to install the executables and man pages.
|
|
You should now be ready to GenePool. Two caveats - to run
the executables, wherever you installed the executables
(/usr/local/bin by default) needs to be in your PATH
environment variable; and to see the manpages, wherever you
installed the man pages (/usr/local/man by default)
needs to be in your MANPATH. If you cant run the programs or
see the man pages then you may need to have your systems
adminstrator help you set up your PATH and MANPATH
environment variables.
|
|
Binary Installation for Windows
enviroments. |
|
1.
|
|
Obtain a copy of the GenePool binary distribution file
that matches your operating system and architecture and then
uncompress the file.
|
|
2.
|
|
Locate the directory where you uncompressed the file and
double click on gpgui_Setup.msi.
|
|
3.
|
|
Select a folder to install GenePool to.
|
|
The folder should contain both the GenePool.exe and the
gpgui.exe. The GenePool.exe can be run from the dos prompt.
The gpgui.exe is a graphical user interface that allows
users to create configuration files and run the extraction,
normalization and analysis processes.
|
|
Source Code Installation for Unix like
environments. |
|
To compile GenePool from source, you will need 3 source
code distributions - GenePool, Affymetrix Fusion SDK, Boost,
and Apache Xerces XML parser (C-version). Full instructions
for obtaining each is given below. If you are not compiling
on Linux, you may also need a copy of the GNU Autotools
(autoconf and automake) so that you can construct a valid
configure script (see step 3 below).
At some point during the Affymetrix download process, you
will be required to log into the Affymetrix Developer
Nextwork (ADN) which means you must register for an ADN
account if you havent already. Registering for the ADN is
free and is probably a good idea if you are regularly
analyzing Affymetrix data since the ADN pages contain useful
data files and software as well as forums where Affymetrix
software developers will answer questions. You will also
have to accept the license terms for the Affymetrix Fusion
library or the download will be blocked.
|
|
1.
|
|
Obtain a copy of the GenePool source code distribution
file and uncompress it. A command something like:
|
|
gunzip -c GenePool-0.9.1.tar.gz | tar xvf -
should work on most unix machines.
|
|
2.
|
|
Obtain and compile a copy of the Affymetrix Fusion SDK
and the "C" version of the Apache xerces XML
processor as detailed in the Compiling the Affymetrix
Fusion SDK section below. The Fusion and xerces source
code will be used to create a Fusion library file
(libfusion.a) and we will need that library plus the
original Fusion and xerces header files (.h) during
compilation and linking of the GenePool executable.
|
|
3.
|
|
Download and install the boost library from
(http://www.boost.org).
|
|
4.
|
|
The GenePool source code distribution contains a
configure script but it was built on a OS X box so if
you are compiling on any other platform you may need to
regenerate this script to have it correctly tailored to your
platform. To do this you will need a copy of autoconf
and automake which are part of the GNU Autotools.
Assuming you have these tools installed, all you need to do
is execute autoreconf which will read the
configure.ac file and regenerate configure.
|
|
4.
|
|
Execute ./configure to configure the local
installation ready for compilation. If you do not have
permissions to install programs and man pages into the
/usr/local/ directory then you will need to specify
an alternative install location using a command of the form
./configure --prefix=/your/alternate/path.
|
|
5.
|
|
Execute make to compile and link the source
code.
|
|
6.
|
|
Execute make install to install the executables
and man pages.
|
|
You should now be ready to GenePool. Two caveats - to run
the executables, wherever you installed the executables
(/usr/local/bin by default) needs to be in your PATH
environment variable; and to see the manpages, wherever you
installed the man pages (/usr/local/man by default)
needs to be in your MANPATH. If you cant run the programs or
see the man pages then you may need to have your systems
adminstrator help you set up your PATH and MANPATH
environment variables.
|
|
Compiling the Affymetrix Fusion SDK |
|
You should alread have completed step 1 of the Source
Code Installation section above so you should already
have a directory containing the uncompressed source code for
GenePool. To create a compiled Affymetrix Fusion SDK library
ready for linking with the GenePool source code:
|
|
1.
|
|
Go to the Affymetrix website
(http://www.affymetrix.com), and click on the
Support tab at the top of the page. On the next page,
click on the Developer Network link from the menu of
links on the left side of the page. This will take you to
the home page of the Affymetrix Developer Network (ADN)
where a link to the Fusion SDK is available. When you reach
the download page, you'll want the "Full SDK".
|
|
2.
|
|
Copy the Fusion SDK distribution file (usually called
something like affy-fusion-release-107.zip)
inside the GenePool source code directory and unzip
it. This should create a directory called affy/ in
which case you can safely skip step 3. If unzipping the
fusion distribution creates a directory called
cvs-head or any name other than affy/ then you
will need to do step 3.
|
|
3.
|
|
Edit the SDK_DIR variable in
Makefile.FusionSDK so that it points to the
"root" of the Affymetrix Fusion code that was
uncompressed in step 2. The "root" directory is
called sdk/ and it should contain a heap of
subdirectories including calvin_files/,
files/, and file_formats/. You may have to
browse through the fusion distribution to find the sdk/
directory.
|
|
4.
|
|
Go to the Apache Xerces XML parser website
(http://xerces.apache.org), and click on the
Xerces C link in the menu on the left margin of the
page. On the next page, click on the Download link in
the menu on the left margin of the page. You should now be
on the Download page for the C version of the Xerces XML
parser so scroll down until you find a section titled
Current Source Releases of Xerces-C. You can download
the .zip or .tar.gz file but we’ll
assume you took the .tar.gz version.
|
|
5.
|
|
Place the Xerces distribution file (usually called
xerces-c-current.tar.gz) inside the GenePool
source code directory and uncompress it. A command something
like:
|
|
gunzip -c xerces-c-current.tar.gz | tar xvf -
should work on most unix machines.
|
|
6.
|
|
Edit the XERCES_ROOT variable in
Makefile.FusionSDK so that it points to the
"root" of the xerces-c code that was uncompressed
in step 5. The xerces directory name usually incorporates
the version number (for example xerces-c-src_2_7_0/)
so you are almost certainly going to have to edit the
default xerces directory that appears in
Makefile.FusionSDK.
|
|
7.
|
|
Execute make --file=Makefile.FusionSDK which will
compile and link the Fusion and xerces code and create a
libfusion.a library that we can link gpextract
and gpanalyze against. This process could take up to
10 minutes depending upon the power of your CPU.
|
|
Source Code Installation for Windows
environments. |
|
The windows version of genepool can be compiled using
Micrsoft Visual C++ 2005. The entire project directory is
included. A copy of the Fusion library is included with the
project and is located under the ext directory. The boost
library will need to be downloaded and installed from
(http://www.boost.org) in order to compile the code
on windows.
|
FILE FORMATS
|
The GenePool binary processes many different plain txt
files. This section provides a brief outline of the role and
format of each of these files. Unless specified otherwise,
all plain text files should be tab-delimited and should have
Unix-style line endings - a single "LineFeed"
character.
|
|
This file is read by the Extraction [-e
filename.ini] process and contains sections that
describe the experiment and the platforms in the analysis.
The ini file describes the extraction to be performed. An
extraction ini file is separated into blocks that describe
the data in the experiment. These blocks are named
[EXPERIMENT], and [PLATFORM]. Only one
.udf file is generated as a result of an extraction
process.
|
|
The [EXPERIMENT] block contains name of the
experiment, a description of the |
|
experiment and the directory where all of the data files
are held. Name = Name of Experiment
Description = A brief description of the experiment
Directory = The directory where all of the data files
are kept
|
|
The [PLATFORM] block defines the platform for the
genotype arrays used in the experiment. |
|
Illumina and Affymetrix differ in how raw data is
formated. Affymetrix has binary CEL files and Illumina has
text files. The format of the platform differ in the ini
file. Examples are given in this header below. If more than
one platform is used during a data extraction, provide
different tags to logically separate them, such as
[PLATFORM1] and [PLATFORM2]. Vendor = Illumina or
Affymetrix (Required) AnnotationFileName = File must
be tab delimited as VendorID, rs_ID, Chromosome, Position
CDFFileName = Mapping file used by Affymetrix
only
|
|
This file is read by the Analysis [-a
filename.ini] process and contains sections that
describe how the analysis will occur. These blocks are:
[DATA], [GROUP], [STAGE], and [ANALYSIS]. Currently an
analysis file uses only one *.udf in the [DATA] section The
[DATA] block contains the full file path of UDF file. A UDF
is required in order to run an analysis. The [GROUP] block
defines a group of chips contained in the UDF. The [STAGE]
block defines a single stage of how to analyze the data. The
[ANALYSIS] block brings together all the defined blocks for
an analysis.
The parameters for the [DATA] block are: UDF = The
full path and name of the udf file. (Required)
The parameters for the [GROUP] block are: NAME =
User defined name of the group. (Required) POOL =
Average number of individuals in the pool. (Required for
SINGLEMARKER and MULTIMARKER) CHIP = The name of the
chip. (Required)
The parameters for the [STAGE] block are: NAME =
User defined name of the stage (Required) METHOD =
(SILHOUETTE, CENTROID, TTEST, SINGLEMARKER, MULTIMARKER)
(Required) DISTANCE = (EUCLIDIAN, MANHATTAN,
MODIFIED-MANHATTAN) (Required for Silhouette and Centroid)
RAS = (NORMAL, K-CORRECTION, ARCTAN) (Required)
FILTER = Number of top scoring SNPS for the next
stage. (User Defined) LDFILE = Full path to the
LDFile. (Required only for MULTIMARKER)
K-CORRECTIONFILE = Full path to the K-Correction
file. (May be left out if RAS not KCORRECTION)
OUTPUTFILE = Full path and filename for stage output.
(Required)
The parameters for the [ANALYSIS] block are: NAME
= User defined name of Analysis. GROUP1 = Name of
first user defined group. GROUP2= Name of second user
defined group. STAGE1 = Name of user defined stage.
RUN = (ON, OFF) Determines whether the specific
analysis is run.
|
|
The user can optionally supply this file which contains
annotation information about the SNPs on the arrays. The
file contains 4 columns: |
|
SNPId dbSNPId Chromosome Base
|
|
This file is only useful for Affymetrix analyses. It
contains average allele frequencies for AA, AB and AA calls
for every probe quartet on a given chip and allows for the
calculation of quartet-specific k-correction factors.
Because the number of quartets differs between platforms,
this file will contain a variable number of columns however
the general pattern is: |
|