Date of Award

2020

First Advisor

Marina Barsky

Second Advisor

Tim Susse

Abstract

We investigate the problem of building full-text substring indexes for inputs significantly larger than the main memory. This problem is especially important in context of biological sequence analysis where biological polymers can be thought of as very large contiguous strings. The final goal is to index every possible substring to facilitate efficient queries on very long strings - strings that cannot be entirely loaded into the main memory. We propose a new simple and scalable algorithm for constructing said index for such out-of-core inputs. Our new algorithm, Suffix Rank, scales to arbitrarily large inputs, using disk as a memory extension. It solves the problem in just O(log n) scans over the disk-resident data. We evaluate the practical performance of our new algorithm, and show that for very large inputs it outperforms current state-of-the-art solutions, such as eSAIS [7] and SAscan [13].

Simon's Rock Off-campus Download

Simon's Rock students and employees can log in from off-campus by clicking on the Off-campus Download button and entering their Simon's Rock username and password.

Share

COinS