===================== INSTRUCTIONS ======================

This document contains instructions for compiling and
running SIMPLE (Sequential Imputation for Multi-Point
Linkage Estimation).  SIMPLE uses the same input files used
in GENEHUNTER and takes similar commands.

SIMPLE is made up of two programs written in C.  The first
program is 'simple' which produces importance sampling
weights and (optionally) simulated in-phase genotypes
(if NPL or QTL analysis) and likelihood (if LOD analysis)
ratios.  This output written to the screen which may be
directed to a file.  The second program, 'scan', reads in
the output from simple and produces the desired statistics
(which are written to the screen).  Both programs take
the same commands.


/*********
* Set up *
*********/

untar and unzip SIMPLE.tar.gz

>gzip -d < SIMPLE.tar.gz | tar xvf -

This will produce a directory called /SIMPLE in the
current directory.  /SIMPLE contains: Makefile file,
simple (compiled), scan (compiled), /src directory, /check
directory and instruct.txt file (this document).  Makefile
is used to compile simple and scan.  /src directory
contains the relevant code.  /check directory contains
sample pedigree files to run SIMPLE and ensure that it
was compiled correctly.


/************
* Compiling *
************/

simple and scan were compiled on a Solaris 7.  To recompile
them type

>make simple scan


/********************************
* Basic form for running simple *
********************************/

>simple [command file] > 'simple output file'

Basic form for running scan 


>scan [command file] < 'simple output file' > 'scan ouput file'

The command file allows for changes from the default setup.
As in GENEHUNTER the command file must at least contain the
`load' and `scan' commands to specify the input files.
The following commands may be used, with the defaults
given in [].


/***********
* Commands *
***********/

>load [linkloci.dat]

Gives the GENEHUNTER compatible marker-locus data (allele
frequencies for each genetic marker, frequency and
penetrance information for the disease).  The format of
this file must be identical to the Linkage parameter file
(output from the PREPLINK program).


>scan [ped.dat]

Gives the location of the file with the pedigree, marker,
and disease data.  The pedigree should be in the Linkage
pedigree input format (before running MAKEPED or doing
any preprocessing!).  Each line of this file must have
the following structure:

     3    12   8   9   1   2   1      1 2 ... 0 0   1.269
    (a)   (b) (c) (d) (e) (f) (g)     (h -------)    (i)

    (a)  pedigree name
    (b)  individual ID #
    (c)  father's ID #
    (d)  mother's ID #
    (e)  sex (1=MALE, 2=FEMALE)
    (f)  affectation status (0=UNKNOWN, 1=UNAFFECTED, 
         2=AFFECTED)
    (g)  liability class (OPTIONAL) - classes specified 
         in marker data file
    (h)  marker genotypes
    (i)  quantitative trait values

A `0' in any of the disease phenotype or marker genotype
positions (as in the the genotypes for the last marker
above) indicates missing data. (See the example pedigree
file in /check.)  A non-numeric character for quantitative
traits is the default for missing values for quantitative
traits.  To change this to a numeric character use the
command:


>missingQT [ # ]

where `#' is the numerical character for a missing value
in the quantitative traits data.

You should only enter one pedigree at a time, though in
theory the software could handle more than one.  However
this will put a strain on the memory.  In general, you
will be able to handle bigger pedigrees if they are run
one at a time.


>maxiter [ 5000 ]

The number of imputed data sets to be generated by the
sequential imputation procedure.  The null distribution
is estimated with 20*maxiter iterations.


>increment step [ 5 ]

Acts the same way as increment step does in GENEHUNTER.
However, simple does not currently calculate scores outside
the marker map.  This should be changed in the future.
In addition the increment scan command of GENEHUNTER
has not been implemented, and may not be in the future
versions.


>seed [ 123456789 ]

As simple is a Monte Carlo procedure, a starting point
for the random number generator is required.  While a
default value is given, this should be set for every run.
A valid integer between 0 and 2147483647 is required.
SIMPLE automatically drops a random integer in a file
called 'seed' after execution.  This file may be appended
to the command file to accomodate changing the seed.


>debug 

There is a debug mode simple which allows for intermediate
results to be examined.  It was mainly added originally to
allow easier debugging as the code was being developed.
The default is not to debug.  If you wish to use this
procedure, inclue 'debug' in the command file.


>process order [1 2 ... nmarkers]

The order that the markers are processed can be set.
The default is to process the markers in decreasing order
based on the number of alleles (Most alleles processed
first, least alleles processed last).  If two or more
markers have the same number of alleles, the lowest
numbered marker gets done first.  If a disease locus is
to be processed, it is always processed last.


>interference [ 19 0 ]

Currently only one interference model is included
in SIMPLE.  The chisquare model with intensity m is
indicated by

>interference 19 m

Note that a chisquare model with m=0 is the no interference
model.  This is the default setting.  In the future, other
interference models will be added.  The only change that
will be observed by the user is that additional options
to this command will become available.


>analysis [ BOTH ]

As in GENEHUNTER one may input 'BOTH', 'NPL' or 'LOD'.
In addition, one may input 'NONE'.  This may be useful if
you just wish to conduct a QTL analysis.


>score [ PAIRS ]

As in GENEHUNTER one may input PAIRS or ALL.


>units [ ]

One may input 'cM' or 'rec-frac'.  The default, as
in GENEHUNTER, is to assume that the distances are in
recombination fractions unless at least one distance is
greater than .5.


>peel [ <file name> ]

Indicate peeling order in file.  This is only necessary
if the pedigree contains loops.  The file should have
2 columns.  The first column contains the cut set.
The second column contains the peel set.  If there is more
than one member in any of the sets then separate them with
commas.  The order of the cut and peel sets specifies the
peeling order.  The last cut set, the root, should be 0.
An example peeling file is given below.

>849, 845        853
>844, 848        850
>844, 848        851
>839, 840        843
>839, 840        846
>839, 840        847
>845, 844        839,840
>849, 848        841,842
>849, 845, 844   848
>849, 845        844
>849             845
>0               849


>total stat

As in GENEHUNTER, total stat produces the scores for the
combined pedigrees.  The default is to produce the scores
for the pedigrees separately.


>dumppairs

Print out the expected alleles shared IBD for each
relative pair.  This is printed after the scores.  The rows
correspond to the location and the columns correspond to
the pair.  The header lists the PIDs of the pairs that
each column corresponds to.  The first element of each
row is labeled by the location that it corresponds to.
Here is an example of the output from dumppairs:

>.
>.
>.
>===========================
>Expected alleles shared IBD
>===========================
>pedigree   1
>pair:    1     2     3 ...
>mem1:   839   839   839 ...
>mem2:   843   844   845 ...
> 0.00 1.000000e+00 1.000000e+00 1.000000e+00 ...
> 1.87 1.000000e+00 1.000000e+00 1.000000e+00 ...
> 3.73 1.000000e+00 1.000000e+00 1.000000e+00 ...
>.
>.
>.

The first pair in pedigree 1 contained the members 839
and 843 who had expected alleles shared IBD of 1, etc...

>dumpprobs

Print out the probability of sharing 0,1 2 alleles IBD for each
relative pair.  This is printed after the scores.  The rows
correspond to the location and the columns correspond to the pair.  The
header lists the PIDs of the pairs that each column corresponds to.
The first element of each row is labeled by the location that it
corresponds to and in parentheses the number of alleles shared IBD.
Here is an example of the output from dumpprobs:

>.
>.
>.
>===========================
>Expected alleles shared IBD
>===========================
>pedigree   1
>pair:    1     2     3 ...
>mem1:   839   839   839 ...
>mem2:   843   844   845 ...
> 0.00 (0) 0.000000e+00 0.000000e+00 0.000000e+00 ...
> 0.00 (1) 1.000000e+00 1.000000e+00 1.000000e+00 ...
> 0.00 (2) 0.000000e+00 0.000000e+00 0.000000e+00 ...
> 1.87 (0) 0.000000e+00 0.000000e+00 0.000000e+00 ...
>.
>.
>.

The first pair in pedigree 1 contained the members 839
and 843 who had probability 1 of sharing 1 allele IBD...

>dumpscores

Print out the raw scores simulated conditioned on the data.
The output is printed out before the estimated scores
are printed.  The rows correspond to each iteration.
The first column is iteration number, the second column is
the weight and the rest of the columns correspond to the
simulated scores at each location.  Here is an example
output from dumpscores:

>1 4.809104e-14 2.500000e-01 2.500000e-01 5.000000e-01 ...
>2 4.809104e-14 2.500000e-01 2.500000e-01 5.000000e-01 ...
>3 4.809104e-14 2.500000e-01 5.000000e-01 5.000000e-01 ...
>.
>.
>.

The first three iterations are shown.  The weight for the
first iteration is 4.809104e-14.  The score simulated in
the first iteration at the first location is 2.500000e-01,
etc...


/***********************
* Example command file *
***********************/

The following example command file includes all possible
commands.  In most cases the default values are used.
This would be from an example with 5 markers and a disease.


>load linkloci.dat
>scan ped.dat
>analysis BOTH
>score PAIRS
>maxiter 5000
>increment step 5
>seed 123456789
>process order 1 2 3 4 5 
>interference 19 0
>units cM
>total stat
>dumppairs
>dumpscores


At the end of the output from simple and scan are written
the the commands that were used for the session.  (See the
example output in /check.)


/****************
* Check results *
****************/

/check contains an example data set in ped.dat and
linkloci.dat.  After you compile SIMPLE you can check
the output from simple with out and the output from scan
with scores.  The command file 'cmd' contains the commands
necessary to do the check and also has other commands
commented out which one may use.  This command file may
serve as a template.


/****************************
* Interpretation of results *
****************************/

Appended at the end is a sample output file after
running SIMPLE on the sample data provided in the 'check'
directory.  The command file was the following:


------------
Command file 
------------

>load linkloci.dat
>scan ped.dat


We see that the command file simply indicated which locus
file to load and which pedigree file to scan.  The default
statistics are both NPL-PAIRS and LOD scores.  (See list
of commands used appended at the end of the output file.)

The first statistics that are reported (see file below)
are the estimated null mean and variance (under the columns
'mean' and 'var', respectively).  Next are the NPL scores
at the different locations.  The first column, `ped',
indicates which pedigree these statistics correspond to.
The second column, `pos', indicates the position that
the statistics were calculated at in cM.  The third
column, `S', are the raw scores at those positions.
The fourth column, `Z', are the standardized scores at
those positions.  The next to last column, `SE(Z)' are
the monte carlo SEs of the standardized scores.  The last
column are the estimated exact p-values.

The LOD scores follow the NPL scores.  The first column
`pos' are the locations in cM where the scores were
calculated.  The second column, `LOD', are the estimated
LOD scores.  And the last column, `SE(LOD)', are the Monte
Carlo SEs of the estimated LOD scores.

-------------------
Example Output File
-------------------

>Null Scores
>ped     mean        var
>*** ********* *********
>  1  1.906368  0.597479
>
>ped   pos         S         Z +/-     SE(Z) p-valu
>*** ***** ********* ********* *** ********* *******
>  1  0.00  1.625000 -0.364010 +/-  0.000000 0.6259
>  1  4.00  1.977075  0.091475 +/-  0.007706 0.2502
>  1  8.00  2.316438  0.530513 +/-  0.009376 0.2502
>  1 12.00  2.667263  0.984381 +/-  0.009395 0.2502
>  1 16.00  3.011613  1.429872 +/-  0.007738 0.2502
>  1 20.00  3.375000  1.899992 +/-  0.000000 0.1236
>  1 24.00  3.016950  1.436777 +/-  0.007702 0.2502
>  1 28.00  2.675387  0.994892 +/-  0.009369 0.2502
>  1 32.00  2.321700  0.537321 +/-  0.009394 0.2502
>  1 36.00  1.982575  0.098590 +/-  0.007749 0.2502
>  1 40.00  1.625000 -0.364010 +/-  0.000000 0.6259
>  1 44.00  1.986063  0.103102 +/-  0.007780 0.2502
>  1 48.00  2.325287  0.541962 +/-  0.009394 0.2502
>  1 52.00  2.661738  0.977233 +/-  0.009403 0.2502
>  1 56.00  3.014400  1.433478 +/-  0.007728 0.2502
>  1 60.00  3.375000  1.899992 +/-  0.000000 0.1236
>
>
>LOD Scores
>pos   LOD       SE(LOD)
>***** *****************
> 0.00 -10000.000000 0.0000
> 4.00  -1.600278 0.0000
> 8.00  -1.426529 0.0000
>12.00  -1.427397 0.0000
>16.00  -1.602947 0.0000
>20.00 -10000.000000 0.0000
>24.00  -1.599642 0.0000
>28.00  -1.426412 0.0000
>32.00  -1.427404 0.0000
>36.00  -1.602604 0.0000
>40.00 -10000.000000 0.0000
>44.00  -1.603792 0.0000
>48.00  -1.427640 0.0000
>52.00  -1.426351 0.0000
>56.00  -1.599998 0.0000
>60.00 -10000.000000 0.0000
>
>The following commands were used.
>An '*' indicates that the command was found in the
>command file 'cmd' and may not necessarily be the
>default value.
>-------------------------------------------------
>load linkloci.dat*
>scan ped.dat*
>analysis BOTH
>score ALL*
>units (AUTO) cM
>maxiter 5000
>increment step 5
>seed 123456789
>NO reweighting used
>process order  (AUTO) 1 2 3 4 5
>interference 19 m=0