Bioinformatics Research Unit > Software > GenePool

genepool

NAME

genepool − analyze genotyping data from pooled genomic DNA

SYNOPSIS

genepool [options]

DESCRIPTION

GenePool is a software package that provides analysis tools for the detection of shifts in relative allele frequency between pooled genomic DNA from cases and controls using SNP-based genotyping microarrays. GenePool uses a file format that accommodates and describes data from multiple genotype chip platforms. The code behind GenePool makes it possible for third parties to easily extend the code by producing their own algorithms; thus, creating an expandable multiple operating system coding platform. GenePool can be compiled on Microsoft, OS X, and UNIX like operating systems.

GenePool consists of one executable program, genepool. Extraction and analysis of data is achieved through the use of configuration files. To perform an extraction the -e option is used with a configuration file as its arguement. Based off the settings in the configuration file a Universal Data File (.udf) file is generated. Like extraction, analysis is performed using the -a option with a configuration file as its arguement. Analysis takes intensity values from the .udf file and uses a variety of data analysis methods, chosen by the user, to assign a score to each SNP. The score indicates how significant is the observed difference in allele frequency between the hybridizations for the two DNA pools. Note that the scores are not p-values.

Dependencies

Affymetrix Fusion SDK.

The genepool -e option uses the Affymetrix Fusion SDK library to read the native CEL and CDF files so GenePool users will have to get a copy of the Fusion SDK source code if they want to compile GenePool from source code. This adds a dependency to the GenePool system but saves us having to create and maintain code to read all of the different types and versions of Affymetrix files. The Fusion SDK is written in C++ which is why genepool is written in C++.

Apache Xerces XML Parser

The Affymetrix Fusion SDK relies on the "C" version of the Apache Xerces XML Parser so this is effectively also a dependency for GenePool. If in future we determine that GenePool is not using any Affymetrix code that uses Xerces, we may be able to remove Xerces as a dependency for GenePool.

Boost Library

The Boost library is used through out the code to take advantage of easy to use functions.

Standard Template Library (STL)

The STL is used through out the code for its various collection types.

INSTALLATION

There are several distributions for GenePool, which can be broken down into binary and source. The different binaries have been compiled to work on specific operating systems. In general, if a binary distribution is available for your machine architecture and operating system (Intel x86 Linux, PowerPC Mac, SPARC Solaris, Windows, etc) then your easiest option is to try the binary distribution first. If the binaries don't work for you then please contact us with the details so we can try to remedy the problem. If there is not a suitable binary distribution for you or you would like to get the best possible performance from GenePool then you should probably try installing from source code.

Binary Installation for UNIX like enviroments.

1.

Obtain a copy of the GenePool binary distribution file that matches your operating system and architecture and uncompress it. A command something like:

gunzip -c GenePool_bin-x86_32-linux-gnu-0.9.1.tar.gz | tar xvf -
should work on most unix machines.

2.

Execute ./configure to configure the local installation ready for installation. If you do not have permissions to install programs and man pages into the /usr/local/ directory then you will need to specify an alternative install location using a command of the form ./configure --prefix=/your/alternate/path.

3.

There is nothing to build so you can skip the usual "make" step and go straight to executing make install to install the executables and man pages.

You should now be ready to GenePool. Two caveats - to run the executables, wherever you installed the executables (/usr/local/bin by default) needs to be in your PATH environment variable; and to see the manpages, wherever you installed the man pages (/usr/local/man by default) needs to be in your MANPATH. If you cant run the programs or see the man pages then you may need to have your systems adminstrator help you set up your PATH and MANPATH environment variables.

Binary Installation for Windows enviroments.

1.

Obtain a copy of the GenePool binary distribution file that matches your operating system and architecture and then uncompress the file.

2.

Locate the directory where you uncompressed the file and double click on gpgui_Setup.msi.

3.

Select a folder to install GenePool to.

The folder should contain both the GenePool.exe and the gpgui.exe. The GenePool.exe can be run from the dos prompt. The gpgui.exe is a graphical user interface that allows users to create configuration files and run the extraction, normalization and analysis processes.

Source Code Installation for Unix like environments.

To compile GenePool from source, you will need 3 source code distributions - GenePool, Affymetrix Fusion SDK, Boost, and Apache Xerces XML parser (C-version). Full instructions for obtaining each is given below. If you are not compiling on Linux, you may also need a copy of the GNU Autotools (autoconf and automake) so that you can construct a valid configure script (see step 3 below).

At some point during the Affymetrix download process, you will be required to log into the Affymetrix Developer Nextwork (ADN) which means you must register for an ADN account if you havent already. Registering for the ADN is free and is probably a good idea if you are regularly analyzing Affymetrix data since the ADN pages contain useful data files and software as well as forums where Affymetrix software developers will answer questions. You will also have to accept the license terms for the Affymetrix Fusion library or the download will be blocked.

1.

Obtain a copy of the GenePool source code distribution file and uncompress it. A command something like:

gunzip -c GenePool-0.9.1.tar.gz | tar xvf -
should work on most unix machines.

2.

Obtain and compile a copy of the Affymetrix Fusion SDK and the "C" version of the Apache xerces XML processor as detailed in the Compiling the Affymetrix Fusion SDK section below. The Fusion and xerces source code will be used to create a Fusion library file (libfusion.a) and we will need that library plus the original Fusion and xerces header files (.h) during compilation and linking of the GenePool executable.

3.

Download and install the boost library from (http://www.boost.org).

4.

The GenePool source code distribution contains a configure script but it was built on a OS X box so if you are compiling on any other platform you may need to regenerate this script to have it correctly tailored to your platform. To do this you will need a copy of autoconf and automake which are part of the GNU Autotools. Assuming you have these tools installed, all you need to do is execute autoreconf which will read the configure.ac file and regenerate configure.

4.

Execute ./configure to configure the local installation ready for compilation. If you do not have permissions to install programs and man pages into the /usr/local/ directory then you will need to specify an alternative install location using a command of the form ./configure --prefix=/your/alternate/path.

5.

Execute make to compile and link the source code.

6.

Execute make install to install the executables and man pages.

You should now be ready to GenePool. Two caveats - to run the executables, wherever you installed the executables (/usr/local/bin by default) needs to be in your PATH environment variable; and to see the manpages, wherever you installed the man pages (/usr/local/man by default) needs to be in your MANPATH. If you cant run the programs or see the man pages then you may need to have your systems adminstrator help you set up your PATH and MANPATH environment variables.

Compiling the Affymetrix Fusion SDK

You should alread have completed step 1 of the Source Code Installation section above so you should already have a directory containing the uncompressed source code for GenePool. To create a compiled Affymetrix Fusion SDK library ready for linking with the GenePool source code:

1.

Go to the Affymetrix website (http://www.affymetrix.com), and click on the Support tab at the top of the page. On the next page, click on the Developer Network link from the menu of links on the left side of the page. This will take you to the home page of the Affymetrix Developer Network (ADN) where a link to the Fusion SDK is available. When you reach the download page, you'll want the "Full SDK".

2.

Copy the Fusion SDK distribution file (usually called something like affy-fusion-release-107.zip) inside the GenePool source code directory and unzip it. This should create a directory called affy/ in which case you can safely skip step 3. If unzipping the fusion distribution creates a directory called cvs-head or any name other than affy/ then you will need to do step 3.

3.

Edit the SDK_DIR variable in Makefile.FusionSDK so that it points to the "root" of the Affymetrix Fusion code that was uncompressed in step 2. The "root" directory is called sdk/ and it should contain a heap of subdirectories including calvin_files/, files/, and file_formats/. You may have to browse through the fusion distribution to find the sdk/ directory.

4.

Go to the Apache Xerces XML parser website (http://xerces.apache.org), and click on the Xerces C link in the menu on the left margin of the page. On the next page, click on the Download link in the menu on the left margin of the page. You should now be on the Download page for the C version of the Xerces XML parser so scroll down until you find a section titled Current Source Releases of Xerces-C. You can download the .zip or .tar.gz file but we’ll assume you took the .tar.gz version.

5.

Place the Xerces distribution file (usually called xerces-c-current.tar.gz) inside the GenePool source code directory and uncompress it. A command something like:

gunzip -c xerces-c-current.tar.gz | tar xvf -
should work on most unix machines.

6.

Edit the XERCES_ROOT variable in Makefile.FusionSDK so that it points to the "root" of the xerces-c code that was uncompressed in step 5. The xerces directory name usually incorporates the version number (for example xerces-c-src_2_7_0/) so you are almost certainly going to have to edit the default xerces directory that appears in Makefile.FusionSDK.

7.

Execute make --file=Makefile.FusionSDK which will compile and link the Fusion and xerces code and create a libfusion.a library that we can link gpextract and gpanalyze against. This process could take up to 10 minutes depending upon the power of your CPU.

Source Code Installation for Windows environments.

The windows version of genepool can be compiled using Micrsoft Visual C++ 2005. The entire project directory is included. A copy of the Fusion library is included with the project and is located under the ext directory. The boost library will need to be downloaded and installed from (http://www.boost.org) in order to compile the code on windows.

FILE FORMATS

The GenePool binary processes many different plain txt files. This section provides a brief outline of the role and format of each of these files. Unless specified otherwise, all plain text files should be tab-delimited and should have Unix-style line endings - a single "LineFeed" character.

1. Extraction ini file

This file is read by the Extraction [-e filename.ini] process and contains sections that describe the experiment and the platforms in the analysis. The ini file describes the extraction to be performed. An extraction ini file is separated into blocks that describe the data in the experiment. These blocks are named [EXPERIMENT], and [PLATFORM]. Only one .udf file is generated as a result of an extraction process.

The [EXPERIMENT] block contains name of the experiment, a description of the

experiment and the directory where all of the data files are held. Name = Name of Experiment Description = A brief description of the experiment Directory = The directory where all of the data files are kept

The [PLATFORM] block defines the platform for the genotype arrays used in the experiment.

Illumina and Affymetrix differ in how raw data is formated. Affymetrix has binary CEL files and Illumina has text files. The format of the platform differ in the ini file. Examples are given in this header below. If more than one platform is used during a data extraction, provide different tags to logically separate them, such as [PLATFORM1] and [PLATFORM2]. Vendor = Illumina or Affymetrix (Required) AnnotationFileName = File must be tab delimited as VendorID, rs_ID, Chromosome, Position CDFFileName = Mapping file used by Affymetrix only

2. Analysis ini file

This file is read by the Analysis [-a filename.ini] process and contains sections that describe how the analysis will occur. These blocks are: [DATA], [GROUP], [STAGE], and [ANALYSIS]. Currently an analysis file uses only one *.udf in the [DATA] section The [DATA] block contains the full file path of UDF file. A UDF is required in order to run an analysis. The [GROUP] block defines a group of chips contained in the UDF. The [STAGE] block defines a single stage of how to analyze the data. The [ANALYSIS] block brings together all the defined blocks for an analysis.

The parameters for the [DATA] block are: UDF = The full path and name of the udf file. (Required)

The parameters for the [GROUP] block are: NAME = User defined name of the group. (Required) POOL = Average number of individuals in the pool. (Required for SINGLEMARKER and MULTIMARKER) CHIP = The name of the chip. (Required)

The parameters for the [STAGE] block are: NAME = User defined name of the stage (Required) METHOD = (SILHOUETTE, CENTROID, TTEST, SINGLEMARKER, MULTIMARKER) (Required) DISTANCE = (EUCLIDIAN, MANHATTAN, MODIFIED-MANHATTAN) (Required for Silhouette and Centroid) RAS = (NORMAL, K-CORRECTION, ARCTAN) (Required) FILTER = Number of top scoring SNPS for the next stage. (User Defined) LDFILE = Full path to the LDFile. (Required only for MULTIMARKER) K-CORRECTIONFILE = Full path to the K-Correction file. (May be left out if RAS not KCORRECTION) OUTPUTFILE = Full path and filename for stage output. (Required)

The parameters for the [ANALYSIS] block are: NAME = User defined name of Analysis. GROUP1 = Name of first user defined group. GROUP2= Name of second user defined group. STAGE1 = Name of user defined stage. RUN = (ON, OFF) Determines whether the specific analysis is run.

3. AnnotationFile

The user can optionally supply this file which contains annotation information about the SNPs on the arrays. The file contains 4 columns:

SNPId dbSNPId Chromosome Base

4. K-Correction File

This file is only useful for Affymetrix analyses. It contains average allele frequencies for AA, AB and AA calls for every probe quartet on a given chip and allows for the calculation of quartet-specific k-correction factors. Because the number of quartets differs between platforms, this file will contain a variable number of columns however the general pattern is:

SNPId NoOfQuartets Q1_AA Q1_AB Q1_BB ... QN_AA QN_AB QN_BB

At the time of this writing, it appears that there is no definitive

formula/algorithm for k-correction factor and only a single RAS calculation formula has been implemented to include k-correction factors: k*A/(A+B). This RAS calculation method can be specified within the analysis configuration file as well as the name of the file of k-correction factors. Sometimes it is not possible to provide values for AA, AB and BB for evary quartet. The current implementation will look at only AB correction factor.

5. Linkage Disequilibrium File

This file is needed when running a multmarker analysis. The file contains 3 columns:

dbSNPId1 dbSNPId2 R-Square

6. Ras Means File

The user can optionally supply this file which contains annotation information about the SNPs on the arrays. The SNP_ID contains the vendor ID of the particular SNP. The File columns contain the mean of the RAS values for that particular SNP_ID in a file. The number of columns depends on the .udf file:

SNP_ID File1 File2 File3 File4

7. Centroid and Silhouette Output Files

The output from the Centroid and Silhouette analyses have identical output. Cases and Controls tells how many data points were in each group for that SNP. The file contains 7 columns:

VendorID rs_ID Chromosome Locus Cases Controls Score

8. Multimarker Output File

This output file is sorted descending by the the Multmarker score. The LDFlag illustrates whether Linkage Disequilibrium data is found. 1 means that data existed while 0 means there wasn’t LD data for that SNP. The file contains 8 columns:

VendorID rs_ID Chromosome Locus Singlemarker Multimarker LDFlag R-Square

OPTIONS

−e config_file

This option uses the settings in the extraction formatted configuration file to process the extraction and create a .udf file.

−a config_file

This option uses the settings in the analysis formatted configuration file to analyze the data contained within the .udf file.

−−TextHeader source_file target_file

This option outputs the contents of a udf file header into a text file. This option takes two arguements: The name of the .udf file and the name and path of the ouput text file.

−−TextFile source_file target_file

This option outputs the contents of the entire .udf file into a text file. This option takes two arguements: The name of the .udf file and the name and path of the ouput text file.

−n type source_file target_file

This option performs normalization on a .udf file and outputs the results to another .udf file.

       Normalization Types:
            1 => divide by the mean of the channel

−−RasMeans udf_file

This option outputs the RAS mean for each SNP on a chip. This option takes one arguement: The name of the of the .udf file. The output is a tab-delimited file where all the SNPs from a platform are listed column wise by the original data file that contained the probe values.

−−trace

Enable Diagnostics Tracing

KNOWN ISSUES

Compiler limitations

The target compilation environments are gcc and Microsoft Visual C++. No other compilers were used in the development of this software.

Data clipping

The original intensity data for each chip feature is stored in the Affymetrix CEL files as a 4-byte floating point number. In the .udf binary data files produced by the -e option each of these intensity values has been converted into a 2-byte unsigned integer meaning that the intensity has been rounded to a whole number and that values above 65535 cannot be stored.

TO DO

Process Multiple Unified data files at a time

The analysis process currently can only process data from one .udf file the next release will enable the user to specify multiple .udf files in the analysis .ini file

Unit tests

The prototype perl scripts that GenePool is based on had unit tests to ensure that as we changed and extended the programs, the underlying calculations were not impacted. We have manually checked GenePool results against the output from the prototypes however to maintain our sanity, we need to roll the unit testing feature forward into the C/C++ version.

Quality metrics

We need "chips and SNPs" quality scores and a multi-level method to selectively exclude some chips/SNPs/samples from a given analysis. All exclusions effectively just drop SNP scores but at different levels: a SNP-level exclusion would drop all scores for a given SNP across all samples; a Sample-level exclusion would drop all SNPs for all chips for the given sample; and a chip-level exclusion would drop all SNPs on a given chip (i.e. some SNPs for a given sample).

GUI for OS X and Linux platforms

Create binary executables for the OS X and Linux versions of GenePool.

Add additional reporting options

Add funtionality to allow user to specify sorting options for reports and create sliding window reports.

AUTHORS

Sotiris Mitropanopoulos <smitropa@tgen.org>
John Pearson <jpearson@tgen.org>
David Craig <dcraig@tgen.org>
Nils Homer <nhomer@tgen.org>
Alexis Chrisotforides <achristoforides@tgen.org>
James Long <jlong@tgen.org>

COPYRIGHT

GenePool is copyright 2006-2008 by The Translational Genomics Research Institute. All rights reserved. This License is limited to, and you may use the Software solely for, your own internal and non-commercial use for academic and research purposes. Without limiting the foregoing, you may not use the Software as part of, or in any way in connection with the production, marketing, sale or support of any commercial product or service. For commercial use, please contact licensing@tgen.org. By installing this Software you are agreeing to the terms of the LICENSE file distributed with this software.

In any work or product derived from the use of this Software, proper attribution of the authors as the source of the software or data must be made. The following URL should be cited:

http://bioinformatics.tgen.org/software/genepool/



  File last modified: Wed Apr 2 11:45:26 2008