# MendelSoft

Mendelsoft is a pedigree genotyping correction frontend to toulbar2. It can read .pre files, identify likely genotyping errors and propose correction from them, asuming the pedigree structure is correct. It can rely either on a parcimonious criteria or a more involved maximum log-likelihood criteria.

## Introduction

**MendelSoft** is an open source software which detects marker genotyping incompatibilities (Mendelian errors only) in complex pedigrees using weighted constraint satisfaction techniques.

The input of the software is a pedigree data with genotyping data at a **single** locus.
The output of the software is a list of individuals for which the removal of their genotyping data restores consistency. This list is of *minimum size* when the program ends.

Another possibility is to find the most probable consistent correction with respect to a Bayesian formulation of the problem. In this case, the output of the software is a list of individuals for which predicted genotypes differ from their genotyping data and such that the corresponding *joint probability for the whole problem is maximum*.

The problem, its formulation as a weighted constraint satisfaction problem, and some experimental results on simulated and real large animal pedigrees (up to 129516 individuals) are described in:

M. Sanchez, S. de Givry, and T. Schiex

Mendelian error detection in complex pedigrees using weighted constraint satisfaction techniques (preprint)

In *Constraints* journal, special issue on bioinformatics, 13(1), 2008.

S. de Givry, I. Palhiere, Z. Vitezica, and T. Schiex

Mendelian error detection in complex pedigree using weighted constraint satisfaction techniques

In *ICLP-05 workshop on Constraint Based Methods for
Bioinformatics*, page 9p., Sitges, Spain, 2005.

## Download and installation

Mendelsoft is based on __toulbar2__ solver. Download the latest C++ source code and release here: toulbar2 for Linux. A precompiled Linux package is available on __debian__ distributions.

**Version 0.9.8** can be found here: version 0.9.8 for Linux and Windows (executables only).

On Linux computers (it should work on other systems also), after downloading, extract the zipped source codes:

```
unzip toulbar2-1.0.1.zip
```

Then, enter in the source code directory and compile using cmake/ccmake (set option MENDELSOFT_ONLY to ON):

```
cd toulbar2-1.0.1
ccmake .
cmake .
make
```

It will generate an executable called __mendelsoft__ in directorty *bin/Linux*. Note that if you already have an executable __toulbar2__, it does the same as __mendelsoft__ plus some extra solving options.

On **Windows** computers, compile all the .cpp files together using Visual C++ with the following macro definitions:

```
DMENDELSOFT -DWCSPFORMATONLY -DNARYCHAR -DINT_COST -DDOUBLE_PROB -DWIN32 -D_DEBUG -D_WINDOWS -DWINDOWS -DNDEBUG
```

*Warning!* On **Windows** computers, the maximum precision for probability encoding should be lower than 4. See below.

## Input File Format

**MendelSoft** input format is based on LINKAGE format (PEDFILE).
The input must contain the following information for each individual:

- a pedigree number (locus)
- an individual identification number (different from zero), or id
- father's id number
- mother's id number
- sex
- first allele number
- second allele number

Father and mother id’s are 0 for founders, or other members of the pedigree for whom information on parents is absent. Partial information is allowed (when only one parent is known).

The sex field is coded 1 for males and 2 for females.

The genotyping data consists of two unordered allele numbers (positive integers). These are codominant alleles at a single locus. An unknown genotype is coded as 0 0.

If you assume there are no errors with a particular genotyping data, then the data can be made mandatory by using negative allele numbers.

## Running MendelSoft with a parsimony criterion

Try the following simple pedigree available in the source code directory and here:

```
mendelsoft simple.pre
```

The software output will be:

```
5 informative individuals found (either genotyped or having a genotyped descendant).
Read pedigree with 5 individuals, 3 founders, 4 alleles, 4 genotypings and 3 generations.
Cost function decomposition time : 0 seconds.
Preprocessing Time : 0.01 seconds.
0 unassigned variables, 0 values in all current domains (med. size:0, max size:10) and 0 non-unary cost functions (med. degree:0)
Initial lower and upper bounds: [1, 2] 50.000%
New solution: 1 (0 backtracks, 0 nodes, depth 1)
Correction: 2 (1)
Optimum: 1 in 0 backtracks and 0 nodes ( 2 removals by DEE) and 0.001 seconds.
end.
```

Check the last solution found in the output. Its *cost* corresponds to the minimum number of genotyping data to be removed in the pedigree in order to restore consistency (equal to 1 in this example). The following Correction output line gives a list of individual ids for which the removal of their genotyping data restores consistency, followed by the number of corrections in parenthesis.

Use option *-w* (*or without the minus sign for previous version 0.8a or older*) to save a consistent pedigree corresponding to the last correction found into a file called *pedigree_corrected.pre* in the current directory.
Try the following command:

```
mendelsoft simple.pre -w
```

*or for previous version 0.8a:*

```
mendelsoft simple.pre w
```

and look at the output file *pedigree_corrected.pre*.

* NEW!* Use option

*--save=filename*to modifiy the output file name.

Executing __mendelsoft__ without any parameter gives a help message with available options.
It is possible to speed up the search by giving an initial strict upper-bound on the maximum number of genotyping data to be removed. If this upper-bound is lower than or equal to the optimum, then __mendelsoft__ will return no solution.

Try the following command:

```
mendelsoft simple.pre -ub=1
```

*or for previous version 0.8a:*

```
mendelsoft simple.pre 1
```

The software output will be:

```
5 individuals found with a genotyped descendant..
No solution found by initial propagation!
end.
```

Other pedigree examples can be found here.

## Running MendelSoft with a Bayesian formulation

Use option *y* to find the most probable consistent correction. Try it on the simple pedigree problem:

```
mendelsoft simple.pre -w -bayes
```

*or for previous version 0.8a:*

```
mendelsoft simple.pre wy
```

The software output will be:

```
5 informative individuals found (either genotyped or having a genotyped descendant).
Read pedigree with 5 individuals, 3 founders, 4 alleles, 4 genotypings and 3 generations.
Bayesian MPE (genotyping error rate: 0.05, genotype prior: 0, precision(1-10^-p): 7, normalization: 1e+07, ub: 274677317)
Cost function decomposition time : 8e-06 seconds.
Preprocessing time: 0.000853 seconds.
0 unassigned variables, 0 values in all current domains (med. size:0, max size:10) and 0 non-unary cost functions (med. degree:0)
Initial lower and upper bounds: [136646016, 136646017] 0.000%
New solution: 136646016 energy: 13.665 prob: 1.163e-06 (0 backtracks, 0 nodes, depth 1)
Correction: 5 (1)
Optimum: 136646016 energy: 13.665 prob: 1.163e-06 in 0 backtracks and 0 nodes ( 3 removals by DEE) and 0.001 seconds.
end.
```

Look at the output file *pedigree_corrected.pre*, individual 5 has been corrected with a predicted genotype of 1/3, having a joint probability of 1.16289e-06. In comparison, the correction on individual 2 found by the previous parsimony approach has a joint probability of 5.81427e-07. Notice that the Bayesian formulation can be much slower to solve than the parsimony approach.

It is possible to select the way of correcting the erroneous genotypings and also predicting the missing genotypes in the output file *pedigree_corrected.pre*, by adding a value (0,1, or 2) just after the option *w*:

*-w=0*(or*w0*for previous version 0.8a): remove erroneous genotypings in the output file (default case).*-w=1*(or*w1*): correct erroneous genotypings to their prediction (in conjunction with option*-bayes*(or*y*) for the Bayesian formulation).*-w=2*(or*w2*): correct erroneous genotypings to their prediction and infer missing genotypes for informative individuals (non-informative individuals are not genotyped and have no genotyped descendant in the pedigree).

By default, we assume 5% of genotyping errors and equivalent allele frequencies. See the help message obtained by executing __mendelsoft__ without any parameter to change these defaults. For instance, try the following command line, having only 1% of genotyping error uniform prior and an allele probability distribution equal to ( allele1: 0.4, allele2: 0.4, allele3: 0.1, allele4: 0.1 ):

```
mendelsoft simple.pre -w -bayes -genoError=0.01 -precision=7 -problist= 4 0.4 0.4 0.1 0.1
```

*or for previous version 0.8a:*

```
mendelsoft simple.pre wy 0.01 7 2 0.4 0.4 0.1 0.1
```

*Warning!* On **Windows** computers, the maximum precision for probability encoding should be lower than 4. Try

```
mendelsoft simple.pre -w -bayes -genoError=0.01 -precision=4 -problist= 4 0.4 0.4 0.1 0.1
```

*or for previous version 0.8a:*

```
mendelsoft simple.pre wy 0.01 4 2 0.4 0.4 0.1 0.1
```

Check genotype priors, deduced from the selected allele probability distribution, using verbose option *-v=1* (or *v* for previous version 0.8a).

You can penalize genotyping removals for individuals which have many genotyped children (only with a Bayesian formulation). Use option *-u=[threshold]* (or *u[threshold]* for previous version 0.8a) with a threshold number of genotyped children. __mendelsoft__ adds a penalty weight (logarithmically proportional to the number of genotyped children) on genotyping removals if the number of genotyped children for the corresponding individuals is strictly greater than the threshold. At each correction output line, the penalized likelihood is decomposed into the original likelihood and the total penalty term (given in parenthesis).

* NEW!* If the problem is large and difficult to solve, you can add one or a combination of the following options:

*-l*(performs several searches with an increasing complexity)*-hbfs*(performs hybrid best-first search with optimality gap information)*-p=3*(applies more problem simplifications in preprocessing)

## Weighted Constraint Satisfaction Problems

**MendelSoft** is directly derived from ToulBar2, an open source weighted constraint satisfaction solver. This is an on-going project and future versions will appear regularly.

The main algorithms used by the solver are described in:

J. Larrosa, and T. Schiex

In the quest of the best form of local consistency for Weighted CSP

In *Proc. of IJCAI-03*, pages 239-244, Acapulco, Mexico, 2003.

S. de Givry, M. Zytnicki, F. Heras, and J. Larrosa

Existential arc consistency: Getting closer to full arc consistency in Weighted CSPs

In *Proc. of IJCAI-05*, pages 84-89, Edinburgh, Scotland, 2005.

J. Larrosa, E. Morancho, and D. Niso

On the practical applicability of Bucket Elimination: Still-life as a case study*Journal of Artificial Intelligence Research*. 23 :421-440, 2005.

C. Lecoutre, L. Sais, S. Tabary, and V. Vidal

Last Conflict based Reasoning

In *Proc. of ECAI-06*, pages 133-137, Trento, Italy, 2006.

For more information, see the old SoftCSP web site and the new Cost Function web site

Please report bugs or problems to: *Simon de Givry*

Email: *simon.de-givry AT inrae.fr*

Last Update:

*May 23nd 2019*