=================================================================
                           Instructions 
=================================================================

This document contains instructions and examples for compiling and
running Starting (the executable in the START package). We have
complied and test run the program in both linux and Solaris 7
operating systems. 


------------
Installation 
------------

* Untar and unzip START.tar.gz by (note that ">" is the system
  prompt and the user does not need to type it)

> gzip -d < START.tar.gz | tar xvf -
> cd START/

The above commands produce a directory called START/ in the current
directory, and enter it. The START/ directory contains the
following three directories: Program, DATA and Gold.

The Program/ directory stores the source files. The user does not need
to understand any of it. The DATA/ directory contains some of the
sample inputs, which the user could then test run Starting on to make
sure it is complied correctly. The results should be compared to those
in the Gold/ directory, which contains the sample outputs.

* To compile the program under Solaris 7, use
> (cd Program; cp Makefile.sun Makefile; make all)

or
* To compile the program under linux, use
> (cd Program; cp Makefile.linux Makefile; make all)

At this point, you should have an executable called "Starting" in the
current directory START/.


-------
Command
-------

Starting takes three input files, and output various run time
information to the standard output (which may be redirected to a log
file, as it contains many debugging information), as well as various
outputs according to the user's request. Sample command is

> Starting locus phenotype parameter > logfile


-----------
Input Files
-----------

In the input files, fields are separated by white spaces(s) or
newlines. 

The first file, the locus file, contains the following information to
be specified by the user line by line (an example is DATA/locus):

1. Number of markers.
2. Maximum length of marker names.
3. Maximum number of alleles. 
   * Note this is asked for convenience purpose. Our program can
     handle up to 15 alleles for a 1,544-member complex pedigree and
     up to 20 alleles for a 221-member complex pedigree. It may handle
     more alleles, but the time increases faster than the order of 
     quadratic.
4. The lower bound and upper bound of correct sum of frequencies. 
   * Due to rounding errors, the sum is not expected to be exactly 1.
     If the allele frequencies of a marker sum to be less than the
     first number or more than the second number, an error is
     considered to occur. If the number of errors is at least one, the
     program will quit with error messages.
5. Name of the first marker and the number of alleles for this marker.
6... Each of these lines gives one allele and the corresponding allele frequency.
   * Note that the alleles should be numbered from 0 to (#.of.allele-1),
     consecutively.
Subsequent lines give the marker name, # alleles and allele
   frequencies for each of the other markers.

The second file, the phenotype file (e.g. DATA/phenotype), gives the
number of individuals in the pedigree in the first line. Note that the
program does not distinguish between single and multiple pedigrees,
although we recommend processing each pedigree separately. The first
four items on the second line provide names for person, father, mother
and gender(you can call them whatever you want, but they need to be
separated by white space(s)). The rest of the second line gives marker 
names separated by white space(s); they need to be the same as those
given in the locus file, and in the same order. Each subsequent line 
give information for an individual, with the following convention:

1. The ID of person, father and mother should be positive integers. 
2. Founders should have 0 as both its father and mother. 
3. Alleles are numbered from 0 to (#.of.allele-1), with -1 denotes
   missing alleles. 
4. Two alleles of a marker for an individual are separated by "/". 

The third file, the parameter file (e.g. DATA/tune.par), gives the
parameters, options, and input, output choices. Dissected line by line
(for meaning of each parameter, see the manuscript): 

1. Maximum # of iterations (should be a multiple of the heating cycle)
2. The heating cycle and the # iterations heated per cycle 
   (usually 1000 and 10)
3. Temperature (usually set to 5) 
4. Relaxing parameter (usually set to 0.1) 
5. Tolerance 
6. # intervals (nonnegative integer) and right end point of each
   interval (should be nonnegative integers in increasing order)
   * If the first number is 0, the rest of the line is ignored, and the
     program searches for starting points with tolerance level given
     in line 5; 
     Otherwise, run the tuning option, with the tolerance taken to be
     the second integer. Percent of iterations with # incompatible
     genotype-phenotype pairs falling into each interval will be
     given, to aid in the choice of a more reasonable tolerance
     level. For example, if this line gives
		  3 2 8 10 
     Then the intervals will be [0,2], [3,8] and [9,10]; and forcing
     will be performed if #incompatible is at most 2.  
7. Output detail?  (1 for yes and 0 for no) 
   * Should be 1 if the tuning option is turned on.  
8. The name of the file to hold the details
   * Should occupy one line even if line 7 gives 0.  
9. Output the starting point as genotype? (1 for yes and 0 for no)
10. The file name of the genotype output. 
    * Should occupy one line even if line 9 gives 0.  
11. Output the starting point as inheritance vector?(1 for yes and 0 for no)
12. The file name of the inheritance vector output 
    * Should occupy one line even if line 11 gives 0.  
      If you set line 9 and 11 to be 0 but does not turn on the tuning
      option, the genotypes will be output. If the tuning option is
      on, the previous four lines are ignored.  
13. The file name for the seeds
    * This file (e.g. DATA/seed.save) should contain two hexadecimal
       numbers, the first one being odd and the second nonzero. When the
       program is finished, it writes the current seed back into this
       file. So, if you want to reproduce results, you need to save the
       seed to another file before running the program.  
14. Number of markers to process 
15....Each subsequent line gives one marker name. 
    * The marker names can be in arbitrary order and should be a
      subset of those given in the locus file. This is useful if a
      second attempt is needed for those markers that the first
      attempt to find starting points has failed.
    

------------ 
Output Files 
------------ 

The standard output contains various run time checking information.
There are also four kinds of output with file names and output choices
given in the parameter file. When the tuning option is turned on, a
detailed file is printed to provide various information (including
error checking, missing percentage, running time, etc), which may aid
the choice of an appropriate tolerance level. When the tuning option
is off, you can request the output of running details, genotype
vector, inheritance vector or any combination of them. The headers in
each file gives clear directions on how to read the file.

--------
Examples
--------
Do the following in the current directory START/

>(cd DATA; cp seed.saved seed.run)
>(cd DATA;  ../Starting locus phenotype tuning.par > log.tune;)

The three input file in DATA/ are configured to the formats described
above. The locus and phenotype files give information on six markers
observed on the HOPS pedigree (see the manuscript for a brief
descriptioncd):
d23s001 d23s002 d23s005 d25s001 d23s003 d24s001
The file tuning.par sets parameters and turn on the tuning option on
line 6 by giving 
6 1 4 8 10 20 112  
See the manuscript for the selection of the interval endpoints. The
maximum number of iterations is set to 5,000. All markers are to
be processed. 

It took about two minute to run this on a AMD 1800+MP processor. Two
files in DATA/ are produced: detail.tune and log.tune.
If you choose to test run this example, make sure these files are the
same as those with the same name in the Gold/ directory. Or you can
simply view the output in the Gold/ directory.

The output detail.tune indicates that marker d23s001 contains errors,
The error family is given in log.tune in the following portion (note
that spouse 5's phenotype is inferred from his neighborhoods
(parents, spouse(s) and kids):
-----------------------------------------------------------------------------
--- Infer missing phenotypes based on neighbors.
 * Inferring missing phenotypes finds errors, see details below:
        NAME  Phenotype  Obs? (-1: Missing, 1:Inferred, 0:Observed)
  Person     8    -1/-1     -1
  Father     7     0/ 6     0
  Mother     6     0/ 6     0
  Spouse     5     4/ 1     1
    Kid     64     4/ 3     0
    Kid      9     1/ 6     0
    Kid    192     4/ 6     0
    Kid     82     1/ 6     0
     Total *4* kids with this spouse
--- Check: phenotype dropping after inferring pheno
=== Done checking: OK
*== Done infer: inferring missing phenotypes finds error

   ::::: DONE: Processing marker d23s001 :::::
------------------------------------------------------------------------------

At the end of the run, it seems that tolerance level of 4 should be
appropriate for markers d23s002 through d23s003. For marker d24s001,
the tolerance may have to go up to 8. In the next run, the tolerance
is set to be 4, and the maximum number of iterations is increased to
50,000. This is given in looking-1.par. We run

>(cd DATA; ../Starting locus phenotype looking-1.par > log.look-1;)

This took 13 minutes, giving output in detail.look-1, genotype.look-1
and inheritance-look.1. From detail.look-1, for markers d23s002,
d25s001 and d23s003, starting points were found within half a
minute. However, after exhausting the 50,000 iterations in 13 minutes,
the program failed to find starting points for marker d23s005 and
d24s001. Actually, the observed data on d23s005 contained an error
deliberately introduced, which our error checking program was not able
to find. The data on d23s005 differs from those on d23s002 in the
phenotype of individual 262 (from the correct 3/6 to the wrong
2/7). This change create an error that can only be discovered by
looking at four generations simultaneously:
------------------------------------------------------------------
Person 252 (phenotype 8/0) and Person 240 (phenoytpe 2/5) has a child
253 with missing phenotype. The marriage between 262 (phenotype 2/7)
and 253 produces child 264 (missing phenotype). 264 married to 267
(phenotype 4/8), and this marriage has a child 268 (phenotype 4/6)
-------------------------------------------------------------------

In fact, other attempts on marker d23s005 always come down to a
minimum of incompatible pair of 1. 

For marker d24s001, the forcing rate is 23.32% and
8.32% of the forcing steps results in some genotype(s) being
replaced. Thus, we have a good amount of forcing, we may keep the
tolerance and do longer run if we are sure the data is right. I decide
to increase the iterations to 500,000 and tolerance to 8. So, we run
the program on the last marker

>(cd DATA; ../Starting locus phenotype looking-2.par > log.look-2;)

This takes 9 minutes. A starting point is found at iteration 39,833.


----
Note
----
* Each input line should not exceed (1024-1) characters, spaces included.


----------------
Acknowledgments
----------------

 We make use of part of the library in the Morgan package
(http://linkage.rockefeller.edu/soft/list.html).


----------
References
----------
See http://www.stat.ohio-state.edu/~statgen/PAPERS/START
for the manuscript describing the algorithm. Users of START should
reference the following two papers.

Luo YQ and Lin S (2003). Finding starting points for Markov chain Monte
Carlo analysis of genetic data from large and complex
pedigrees. Genetic Epidemiology, 25, 14-24.

Lin SL, Thompson EA, Wijsman E (1993) Achieving irreducibility of the
Markov chain Monte Carlo method applied to pedigree data. IMA J Math
Appl Med Biol 10:1-17