===================== INSTRUCTIONS ====================== This document contains instructions for compiling and running SIMPLE (Sequential Imputation for Multi-Point Linkage Estimation). SIMPLE uses the same input files used in GENEHUNTER and takes similar commands. SIMPLE is made up of two programs written in C. The first program is 'simple' which produces importance sampling weights and (optionally) simulated in-phase genotypes (if NPL or QTL analysis) and likelihood (if LOD analysis) ratios. This output written to the screen which may be directed to a file. The second program, 'scan', reads in the output from simple and produces the desired statistics (which are written to the screen). Both programs take the same commands. /********* * Set up * *********/ untar and unzip SIMPLE.tar.gz >gzip -d < SIMPLE.tar.gz | tar xvf - This will produce a directory called /SIMPLE in the current directory. /SIMPLE contains: Makefile file, simple (compiled), scan (compiled), /src directory, /check directory and instruct.txt file (this document). Makefile is used to compile simple and scan. /src directory contains the relevant code. /check directory contains sample pedigree files to run SIMPLE and ensure that it was compiled correctly. /************ * Compiling * ************/ simple and scan were compiled on a Solaris 7. To recompile them type >make simple scan /******************************** * Basic form for running simple * ********************************/ >simple [command file] > 'simple output file' Basic form for running scan >scan [command file] < 'simple output file' > 'scan ouput file' The command file allows for changes from the default setup. As in GENEHUNTER the command file must at least contain the `load' and `scan' commands to specify the input files. The following commands may be used, with the defaults given in []. /*********** * Commands * ***********/ >load [linkloci.dat] Gives the GENEHUNTER compatible marker-locus data (allele frequencies for each genetic marker, frequency and penetrance information for the disease). The format of this file must be identical to the Linkage parameter file (output from the PREPLINK program). >scan [ped.dat] Gives the location of the file with the pedigree, marker, and disease data. The pedigree should be in the Linkage pedigree input format (before running MAKEPED or doing any preprocessing!). Each line of this file must have the following structure: 3 12 8 9 1 2 1 1 2 ... 0 0 1.269 (a) (b) (c) (d) (e) (f) (g) (h -------) (i) (a) pedigree name (b) individual ID # (c) father's ID # (d) mother's ID # (e) sex (1=MALE, 2=FEMALE) (f) affectation status (0=UNKNOWN, 1=UNAFFECTED, 2=AFFECTED) (g) liability class (OPTIONAL) - classes specified in marker data file (h) marker genotypes (i) quantitative trait values A `0' in any of the disease phenotype or marker genotype positions (as in the the genotypes for the last marker above) indicates missing data. (See the example pedigree file in /check.) A non-numeric character for quantitative traits is the default for missing values for quantitative traits. To change this to a numeric character use the command: >missingQT [ # ] where `#' is the numerical character for a missing value in the quantitative traits data. You should only enter one pedigree at a time, though in theory the software could handle more than one. However this will put a strain on the memory. In general, you will be able to handle bigger pedigrees if they are run one at a time. >maxiter [ 5000 ] The number of imputed data sets to be generated by the sequential imputation procedure. The null distribution is estimated with 20*maxiter iterations. >increment step [ 5 ] Acts the same way as increment step does in GENEHUNTER. However, simple does not currently calculate scores outside the marker map. This should be changed in the future. In addition the increment scan command of GENEHUNTER has not been implemented, and may not be in the future versions. >seed [ 123456789 ] As simple is a Monte Carlo procedure, a starting point for the random number generator is required. While a default value is given, this should be set for every run. A valid integer between 0 and 2147483647 is required. SIMPLE automatically drops a random integer in a file called 'seed' after execution. This file may be appended to the command file to accomodate changing the seed. >debug There is a debug mode simple which allows for intermediate results to be examined. It was mainly added originally to allow easier debugging as the code was being developed. The default is not to debug. If you wish to use this procedure, inclue 'debug' in the command file. >process order [1 2 ... nmarkers] The order that the markers are processed can be set. The default is to process the markers in decreasing order based on the number of alleles (Most alleles processed first, least alleles processed last). If two or more markers have the same number of alleles, the lowest numbered marker gets done first. If a disease locus is to be processed, it is always processed last. >interference [ 19 0 ] Currently only one interference model is included in SIMPLE. The chisquare model with intensity m is indicated by >interference 19 m Note that a chisquare model with m=0 is the no interference model. This is the default setting. In the future, other interference models will be added. The only change that will be observed by the user is that additional options to this command will become available. >analysis [ BOTH ] As in GENEHUNTER one may input 'BOTH', 'NPL' or 'LOD'. In addition, one may input 'NONE'. This may be useful if you just wish to conduct a QTL analysis. >score [ PAIRS ] As in GENEHUNTER one may input PAIRS or ALL. >units [ ] One may input 'cM' or 'rec-frac'. The default, as in GENEHUNTER, is to assume that the distances are in recombination fractions unless at least one distance is greater than .5. >peel [ ] Indicate peeling order in file. This is only necessary if the pedigree contains loops. The file should have 2 columns. The first column contains the cut set. The second column contains the peel set. If there is more than one member in any of the sets then separate them with commas. The order of the cut and peel sets specifies the peeling order. The last cut set, the root, should be 0. An example peeling file is given below. >849, 845 853 >844, 848 850 >844, 848 851 >839, 840 843 >839, 840 846 >839, 840 847 >845, 844 839,840 >849, 848 841,842 >849, 845, 844 848 >849, 845 844 >849 845 >0 849 >total stat As in GENEHUNTER, total stat produces the scores for the combined pedigrees. The default is to produce the scores for the pedigrees separately. >dumppairs Print out the expected alleles shared IBD for each relative pair. This is printed after the scores. The rows correspond to the location and the columns correspond to the pair. The header lists the PIDs of the pairs that each column corresponds to. The first element of each row is labeled by the location that it corresponds to. Here is an example of the output from dumppairs: >. >. >. >=========================== >Expected alleles shared IBD >=========================== >pedigree 1 >pair: 1 2 3 ... >mem1: 839 839 839 ... >mem2: 843 844 845 ... > 0.00 1.000000e+00 1.000000e+00 1.000000e+00 ... > 1.87 1.000000e+00 1.000000e+00 1.000000e+00 ... > 3.73 1.000000e+00 1.000000e+00 1.000000e+00 ... >. >. >. The first pair in pedigree 1 contained the members 839 and 843 who had expected alleles shared IBD of 1, etc... >dumpprobs Print out the probability of sharing 0,1 2 alleles IBD for each relative pair. This is printed after the scores. The rows correspond to the location and the columns correspond to the pair. The header lists the PIDs of the pairs that each column corresponds to. The first element of each row is labeled by the location that it corresponds to and in parentheses the number of alleles shared IBD. Here is an example of the output from dumpprobs: >. >. >. >=========================== >Expected alleles shared IBD >=========================== >pedigree 1 >pair: 1 2 3 ... >mem1: 839 839 839 ... >mem2: 843 844 845 ... > 0.00 (0) 0.000000e+00 0.000000e+00 0.000000e+00 ... > 0.00 (1) 1.000000e+00 1.000000e+00 1.000000e+00 ... > 0.00 (2) 0.000000e+00 0.000000e+00 0.000000e+00 ... > 1.87 (0) 0.000000e+00 0.000000e+00 0.000000e+00 ... >. >. >. The first pair in pedigree 1 contained the members 839 and 843 who had probability 1 of sharing 1 allele IBD... >dumpscores Print out the raw scores simulated conditioned on the data. The output is printed out before the estimated scores are printed. The rows correspond to each iteration. The first column is iteration number, the second column is the weight and the rest of the columns correspond to the simulated scores at each location. Here is an example output from dumpscores: >1 4.809104e-14 2.500000e-01 2.500000e-01 5.000000e-01 ... >2 4.809104e-14 2.500000e-01 2.500000e-01 5.000000e-01 ... >3 4.809104e-14 2.500000e-01 5.000000e-01 5.000000e-01 ... >. >. >. The first three iterations are shown. The weight for the first iteration is 4.809104e-14. The score simulated in the first iteration at the first location is 2.500000e-01, etc... /*********************** * Example command file * ***********************/ The following example command file includes all possible commands. In most cases the default values are used. This would be from an example with 5 markers and a disease. >load linkloci.dat >scan ped.dat >analysis BOTH >score PAIRS >maxiter 5000 >increment step 5 >seed 123456789 >process order 1 2 3 4 5 >interference 19 0 >units cM >total stat >dumppairs >dumpscores At the end of the output from simple and scan are written the the commands that were used for the session. (See the example output in /check.) /**************** * Check results * ****************/ /check contains an example data set in ped.dat and linkloci.dat. After you compile SIMPLE you can check the output from simple with out and the output from scan with scores. The command file 'cmd' contains the commands necessary to do the check and also has other commands commented out which one may use. This command file may serve as a template. /**************************** * Interpretation of results * ****************************/ Appended at the end is a sample output file after running SIMPLE on the sample data provided in the 'check' directory. The command file was the following: ------------ Command file ------------ >load linkloci.dat >scan ped.dat We see that the command file simply indicated which locus file to load and which pedigree file to scan. The default statistics are both NPL-PAIRS and LOD scores. (See list of commands used appended at the end of the output file.) The first statistics that are reported (see file below) are the estimated null mean and variance (under the columns 'mean' and 'var', respectively). Next are the NPL scores at the different locations. The first column, `ped', indicates which pedigree these statistics correspond to. The second column, `pos', indicates the position that the statistics were calculated at in cM. The third column, `S', are the raw scores at those positions. The fourth column, `Z', are the standardized scores at those positions. The next to last column, `SE(Z)' are the monte carlo SEs of the standardized scores. The last column are the estimated exact p-values. The LOD scores follow the NPL scores. The first column `pos' are the locations in cM where the scores were calculated. The second column, `LOD', are the estimated LOD scores. And the last column, `SE(LOD)', are the Monte Carlo SEs of the estimated LOD scores. ------------------- Example Output File ------------------- >Null Scores >ped mean var >*** ********* ********* > 1 1.906368 0.597479 > >ped pos S Z +/- SE(Z) p-valu >*** ***** ********* ********* *** ********* ******* > 1 0.00 1.625000 -0.364010 +/- 0.000000 0.6259 > 1 4.00 1.977075 0.091475 +/- 0.007706 0.2502 > 1 8.00 2.316438 0.530513 +/- 0.009376 0.2502 > 1 12.00 2.667263 0.984381 +/- 0.009395 0.2502 > 1 16.00 3.011613 1.429872 +/- 0.007738 0.2502 > 1 20.00 3.375000 1.899992 +/- 0.000000 0.1236 > 1 24.00 3.016950 1.436777 +/- 0.007702 0.2502 > 1 28.00 2.675387 0.994892 +/- 0.009369 0.2502 > 1 32.00 2.321700 0.537321 +/- 0.009394 0.2502 > 1 36.00 1.982575 0.098590 +/- 0.007749 0.2502 > 1 40.00 1.625000 -0.364010 +/- 0.000000 0.6259 > 1 44.00 1.986063 0.103102 +/- 0.007780 0.2502 > 1 48.00 2.325287 0.541962 +/- 0.009394 0.2502 > 1 52.00 2.661738 0.977233 +/- 0.009403 0.2502 > 1 56.00 3.014400 1.433478 +/- 0.007728 0.2502 > 1 60.00 3.375000 1.899992 +/- 0.000000 0.1236 > > >LOD Scores >pos LOD SE(LOD) >***** ***************** > 0.00 -10000.000000 0.0000 > 4.00 -1.600278 0.0000 > 8.00 -1.426529 0.0000 >12.00 -1.427397 0.0000 >16.00 -1.602947 0.0000 >20.00 -10000.000000 0.0000 >24.00 -1.599642 0.0000 >28.00 -1.426412 0.0000 >32.00 -1.427404 0.0000 >36.00 -1.602604 0.0000 >40.00 -10000.000000 0.0000 >44.00 -1.603792 0.0000 >48.00 -1.427640 0.0000 >52.00 -1.426351 0.0000 >56.00 -1.599998 0.0000 >60.00 -10000.000000 0.0000 > >The following commands were used. >An '*' indicates that the command was found in the >command file 'cmd' and may not necessarily be the >default value. >------------------------------------------------- >load linkloci.dat* >scan ped.dat* >analysis BOTH >score ALL* >units (AUTO) cM >maxiter 5000 >increment step 5 >seed 123456789 >NO reweighting used >process order (AUTO) 1 2 3 4 5 >interference 19 m=0