Brief Description ----------------- BioOptimizer is an algorithm designed to clean up motif-finding output by finding the configuration of motif start sites that maximizes a scoring function based on the log-posterior distribution given in Jensen et al (2004) and Jensen and Liu (2004). In addition, the algorithm allows the motif width to vary and attempts to find the best motif width, again in terms of maximizing the scoring function. There are several versions of BioOptimizer for each of the following motif-finding programs: BioOptimizer.biop.ver2 - BioOptimizer for BioProspector input BioOptimizer.aa - BioOptimizer for AlignACE input BioOptimizer.con - BioOptimizer for Consensus input BioOptimizer.meme - BioOptimizer for MEME input BioOptimizer.twoblock.ver2 - BioOptimizer for two-block BioProspector input It is necessary to have results from one of these motif-finding programs (BioProspector, Consensus, AlignACE, MEME) before running BioOptimizer. Users of old versions of BioProspector may need to use the BioOptimizer programs "BioOptimizer.biop.ver1" and "BioOptimizer.twoblock.ver1" since the output format of BioProspector changed in Spring 2004. For the rest of the manual, we refer to all of these programs simply as "BioOptimizer", with the understanding that the actual program being used will be different for different input formats. References ---------- JENSEN, S.T., LIU, X.S., ZHOU, Q. and LIU, J.S. (2004). Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Accepted for publication in Statistical Science. JENSEN, S.T. and LIU, J.S. (2004). BioOptimizer: a Bayesian scoring function approach to motif discovery. Accepted for publication in Bioinformatics. Copyright --------- BioOptimizer, version 1.0, is copyrighted to Shane T. Jensen (2003) BioOptimizer, version 2.0, is copyrighted to Shane T. Jensen (2004) Software Requirements --------------------- It is necessary to have the "Math::SpecFun::Gamma" perl library installed prior to using BioOptimizer. Input Information Needed ------------------------ 1. file of DNA sequences: seqfile Note that sequences should be in the following format: >genename1 acagctagctagcatcgatctagctgctacgat >genename2 agacgtacgatcgatcgactgcgtcatgactac 2. output file from a motif-finding program: inputfile 3. number of different motifs in motif-finding output: nummotif 4. should reverse complement of input sequences also be searched? rc=1 if yes, rc = 0 if no 5. a priori expectation of motif width: w0 Command to Start Program ------------------------ The general command line argument is: ./BioOptimizer seqfile inputfile nummotif rc w0 This command will vary depending on the inputfile used. For example, if BioProspector output is used as input, the command line is ./BioOptimizer.biop.ver2 seqfile inputfile nummotif rc w0 Example: ./BioOptimizer.biop.ver2 example.upstream example.biop 5 1 7 Output: inputfile.opt.all - file with the optimized version of each motif input, along with site predictions inputfile.opt.best - file with optimized motif that had best final score compared to other motifs inputfile.opt.sum - file with summary of each optimization: starting motif width and score compared to final motif width, score and consensus. Notes: 1. Motif is described in terms of a consensus matrix and corresponding consensus sequence. The consensus sequence is formed by taking the dominant nucleotide in each column of the motif matrix, with a capital letter indicating that nucleotide has over 75% conservation, and a small letter if conservation is < 75%. 2. Sites are given as the starting position (from start of input sequences given) of the site, as well as the strand. "f" for forward, "r" for reverse 3. If the biop input file contains only one motif, biop.input.all and biop.input.best will be the same file 4. The null score is given for each motif, which is the score a motif would get if it had the same width and background nucleotide frequencies but no sites at all. This is included for rough comparison only...if the null score is greater than the final motif score, this indicates that the final motif is not strong. If using the two-block version of BioOptimizer on two-block BioProspector output, the command line is: ./BioOptimizer.twoblock seqfile inputfile nummotif rc w1 w2 g1 g2 where w1 = expected width of block 1 w2 = expected width of block 2 g1 = minimum length of gap between blocks g2 = maximum length of gap between blocks Fun Program Details ------------------- 1. BioOptimizer.biop versions now allow for sequences where nucleotides have been masked out by the character "N" and/or "n" 2. BioOptimizer currently used a poisson prior for the motif width and a simple multinomial model for the background parameters.