Research interests | Publications | Software | Teaching | Short CV

Efficient algorithms for structured motif extraction in DNA sequences

Alexandra M. Carvalho

Supervised by Arlindo L. Oliveira and Ana T. Freitas

Master's Thesis, IST, Universidade Técnica de Lisboa, 2004

In this thesis we propose a new algorithm for the extraction of repeated motifs that may represent binding-site consensi in genomic sequences. In particular, the algorithm extracts structured motifs, which are defined as a collection of highly conserved segments of DNA with pre-specified sizes and spacings between them. This type of motifs is highly relevant in the search for gene regulatory mechanisms since promoter regions can be effectively modeled by structured motifs.

The new algorithm uses factor trees, a variation of suffix trees, and a new data structure, called box-links, to store the information about conserved regions that repeat often in the dataset sequences. The time and space complexity, in the worst case analyses, shows a gain over previous algorithms that is exponential on the spacings between the segments of the structured motifs. The application of a prototype implementation of this algorithm to biologically relevant datasets shows the ability of the method to extract relevant consensi. The experimental results also show that this algorithm is much faster than existing ones, sometimes by more than two orders of magnitude.

To deal with the enormous amount of data coming from this large-scale genome sequencing era, a parallel method is also proposed for the efficient extraction of structured motifs. By partitioning the structured motif searching space we divide the most demanding part of the algorithm by a number of processors that can be loosely coupled. In this way we obtain, under conditions that are easily met, a speedup that is linear on the number of available processing units. This speedup is verified by both theoretical and experimental analyses.

**Keywords:** Promoter prediction, Suffix tree, Factor tree, Structured motif, Box-link, Parallel algorithm.

**Get a preprint:**
[ps]
[pdf]
**Get a bibentry:**
[bib]
[bbl]

Research interests | Publications | Software | Teaching | Short CV