Finding short repeating sequences (motifs) in DNA, RNA, and protein sequence data is a popular task in computational biology. It is commonly believed, and often validated, that motifs found in sequences that share common expression features represent binding sites of regulatory elements. For example, searching for motifs in the promoter region of genes whose expression is adversely affected by knocking down a transcription factor should reveal the transcription factor binding site. Despite being first published in 1994, MEME (multiple EM for motif elicitation) is still the most widely used and cited tool for probabilistic motif finding. However, MEME uses a combination of statistics, heuristics and machine learning to find motifs. I will present a fully Bayesian model that searches for motifs in biological sequence data. The MCMC algorithm used considers a wide variety of motif alignments and lengths, as well as an unbounded number of motifs in order to avoid poor local minima in the probability landscape and accurately and effectively sample from the posterior.
