================================================================= Instructions ================================================================= This document contains instructions and examples for compiling and running Starting (the executable in the START package). We have complied and test run the program in both linux and Solaris 7 operating systems. ------------ Installation ------------ * Untar and unzip START.tar.gz by (note that ">" is the system prompt and the user does not need to type it) > gzip -d < START.tar.gz | tar xvf - > cd START/ The above commands produce a directory called START/ in the current directory, and enter it. The START/ directory contains the following three directories: Program, DATA and Gold. The Program/ directory stores the source files. The user does not need to understand any of it. The DATA/ directory contains some of the sample inputs, which the user could then test run Starting on to make sure it is complied correctly. The results should be compared to those in the Gold/ directory, which contains the sample outputs. * To compile the program under Solaris 7, use > (cd Program; cp Makefile.sun Makefile; make all) or * To compile the program under linux, use > (cd Program; cp Makefile.linux Makefile; make all) At this point, you should have an executable called "Starting" in the current directory START/. ------- Command ------- Starting takes three input files, and output various run time information to the standard output (which may be redirected to a log file, as it contains many debugging information), as well as various outputs according to the user's request. Sample command is > Starting locus phenotype parameter > logfile ----------- Input Files ----------- In the input files, fields are separated by white spaces(s) or newlines. The first file, the locus file, contains the following information to be specified by the user line by line (an example is DATA/locus): 1. Number of markers. 2. Maximum length of marker names. 3. Maximum number of alleles. * Note this is asked for convenience purpose. Our program can handle up to 15 alleles for a 1,544-member complex pedigree and up to 20 alleles for a 221-member complex pedigree. It may handle more alleles, but the time increases faster than the order of quadratic. 4. The lower bound and upper bound of correct sum of frequencies. * Due to rounding errors, the sum is not expected to be exactly 1. If the allele frequencies of a marker sum to be less than the first number or more than the second number, an error is considered to occur. If the number of errors is at least one, the program will quit with error messages. 5. Name of the first marker and the number of alleles for this marker. 6... Each of these lines gives one allele and the corresponding allele frequency. * Note that the alleles should be numbered from 0 to (#.of.allele-1), consecutively. Subsequent lines give the marker name, # alleles and allele frequencies for each of the other markers. The second file, the phenotype file (e.g. DATA/phenotype), gives the number of individuals in the pedigree in the first line. Note that the program does not distinguish between single and multiple pedigrees, although we recommend processing each pedigree separately. The first four items on the second line provide names for person, father, mother and gender(you can call them whatever you want, but they need to be separated by white space(s)). The rest of the second line gives marker names separated by white space(s); they need to be the same as those given in the locus file, and in the same order. Each subsequent line give information for an individual, with the following convention: 1. The ID of person, father and mother should be positive integers. 2. Founders should have 0 as both its father and mother. 3. Alleles are numbered from 0 to (#.of.allele-1), with -1 denotes missing alleles. 4. Two alleles of a marker for an individual are separated by "/". The third file, the parameter file (e.g. DATA/tune.par), gives the parameters, options, and input, output choices. Dissected line by line (for meaning of each parameter, see the manuscript): 1. Maximum # of iterations (should be a multiple of the heating cycle) 2. The heating cycle and the # iterations heated per cycle (usually 1000 and 10) 3. Temperature (usually set to 5) 4. Relaxing parameter (usually set to 0.1) 5. Tolerance 6. # intervals (nonnegative integer) and right end point of each interval (should be nonnegative integers in increasing order) * If the first number is 0, the rest of the line is ignored, and the program searches for starting points with tolerance level given in line 5; Otherwise, run the tuning option, with the tolerance taken to be the second integer. Percent of iterations with # incompatible genotype-phenotype pairs falling into each interval will be given, to aid in the choice of a more reasonable tolerance level. For example, if this line gives 3 2 8 10 Then the intervals will be [0,2], [3,8] and [9,10]; and forcing will be performed if #incompatible is at most 2. 7. Output detail? (1 for yes and 0 for no) * Should be 1 if the tuning option is turned on. 8. The name of the file to hold the details * Should occupy one line even if line 7 gives 0. 9. Output the starting point as genotype? (1 for yes and 0 for no) 10. The file name of the genotype output. * Should occupy one line even if line 9 gives 0. 11. Output the starting point as inheritance vector?(1 for yes and 0 for no) 12. The file name of the inheritance vector output * Should occupy one line even if line 11 gives 0. If you set line 9 and 11 to be 0 but does not turn on the tuning option, the genotypes will be output. If the tuning option is on, the previous four lines are ignored. 13. The file name for the seeds * This file (e.g. DATA/seed.save) should contain two hexadecimal numbers, the first one being odd and the second nonzero. When the program is finished, it writes the current seed back into this file. So, if you want to reproduce results, you need to save the seed to another file before running the program. 14. Number of markers to process 15....Each subsequent line gives one marker name. * The marker names can be in arbitrary order and should be a subset of those given in the locus file. This is useful if a second attempt is needed for those markers that the first attempt to find starting points has failed. ------------ Output Files ------------ The standard output contains various run time checking information. There are also four kinds of output with file names and output choices given in the parameter file. When the tuning option is turned on, a detailed file is printed to provide various information (including error checking, missing percentage, running time, etc), which may aid the choice of an appropriate tolerance level. When the tuning option is off, you can request the output of running details, genotype vector, inheritance vector or any combination of them. The headers in each file gives clear directions on how to read the file. -------- Examples -------- Do the following in the current directory START/ >(cd DATA; cp seed.saved seed.run) >(cd DATA; ../Starting locus phenotype tuning.par > log.tune;) The three input file in DATA/ are configured to the formats described above. The locus and phenotype files give information on six markers observed on the HOPS pedigree (see the manuscript for a brief descriptioncd): d23s001 d23s002 d23s005 d25s001 d23s003 d24s001 The file tuning.par sets parameters and turn on the tuning option on line 6 by giving 6 1 4 8 10 20 112 See the manuscript for the selection of the interval endpoints. The maximum number of iterations is set to 5,000. All markers are to be processed. It took about two minute to run this on a AMD 1800+MP processor. Two files in DATA/ are produced: detail.tune and log.tune. If you choose to test run this example, make sure these files are the same as those with the same name in the Gold/ directory. Or you can simply view the output in the Gold/ directory. The output detail.tune indicates that marker d23s001 contains errors, The error family is given in log.tune in the following portion (note that spouse 5's phenotype is inferred from his neighborhoods (parents, spouse(s) and kids): ----------------------------------------------------------------------------- --- Infer missing phenotypes based on neighbors. * Inferring missing phenotypes finds errors, see details below: NAME Phenotype Obs? (-1: Missing, 1:Inferred, 0:Observed) Person 8 -1/-1 -1 Father 7 0/ 6 0 Mother 6 0/ 6 0 Spouse 5 4/ 1 1 Kid 64 4/ 3 0 Kid 9 1/ 6 0 Kid 192 4/ 6 0 Kid 82 1/ 6 0 Total *4* kids with this spouse --- Check: phenotype dropping after inferring pheno === Done checking: OK *== Done infer: inferring missing phenotypes finds error ::::: DONE: Processing marker d23s001 ::::: ------------------------------------------------------------------------------ At the end of the run, it seems that tolerance level of 4 should be appropriate for markers d23s002 through d23s003. For marker d24s001, the tolerance may have to go up to 8. In the next run, the tolerance is set to be 4, and the maximum number of iterations is increased to 50,000. This is given in looking-1.par. We run >(cd DATA; ../Starting locus phenotype looking-1.par > log.look-1;) This took 13 minutes, giving output in detail.look-1, genotype.look-1 and inheritance-look.1. From detail.look-1, for markers d23s002, d25s001 and d23s003, starting points were found within half a minute. However, after exhausting the 50,000 iterations in 13 minutes, the program failed to find starting points for marker d23s005 and d24s001. Actually, the observed data on d23s005 contained an error deliberately introduced, which our error checking program was not able to find. The data on d23s005 differs from those on d23s002 in the phenotype of individual 262 (from the correct 3/6 to the wrong 2/7). This change create an error that can only be discovered by looking at four generations simultaneously: ------------------------------------------------------------------ Person 252 (phenotype 8/0) and Person 240 (phenoytpe 2/5) has a child 253 with missing phenotype. The marriage between 262 (phenotype 2/7) and 253 produces child 264 (missing phenotype). 264 married to 267 (phenotype 4/8), and this marriage has a child 268 (phenotype 4/6) ------------------------------------------------------------------- In fact, other attempts on marker d23s005 always come down to a minimum of incompatible pair of 1. For marker d24s001, the forcing rate is 23.32% and 8.32% of the forcing steps results in some genotype(s) being replaced. Thus, we have a good amount of forcing, we may keep the tolerance and do longer run if we are sure the data is right. I decide to increase the iterations to 500,000 and tolerance to 8. So, we run the program on the last marker >(cd DATA; ../Starting locus phenotype looking-2.par > log.look-2;) This takes 9 minutes. A starting point is found at iteration 39,833. ---- Note ---- * Each input line should not exceed (1024-1) characters, spaces included. ---------------- Acknowledgments ---------------- We make use of part of the library in the Morgan package (http://linkage.rockefeller.edu/soft/list.html). ---------- References ---------- See http://www.stat.ohio-state.edu/~statgen/PAPERS/START for the manuscript describing the algorithm. Users of START should reference the following two papers. Luo YQ and Lin S (2003). Finding starting points for Markov chain Monte Carlo analysis of genetic data from large and complex pedigrees. Genetic Epidemiology, 25, 14-24. Lin SL, Thompson EA, Wijsman E (1993) Achieving irreducibility of the Markov chain Monte Carlo method applied to pedigree data. IMA J Math Appl Med Biol 10:1-17