Date of Submission

Spring 2013

Academic Program

Computer Science

Project Advisor 1

Rebecca Thomas

Abstract/Artist's Statement

English sentence similarity measure is used in a vast number of applications such as online web page information retrieval systems, online advertisements, question answering dialogue systems, text summarization, text mining. Over the years, a number of algorithms have been proposed for this difficult problem, but none of the proposed algorithms give sufficient good answer.

In this project, we explore three different algorithms for computing English sentence simi- larity. The first algorithm, which is well-explored in the literature [Salton and Buckley, 1988, Wu and Salton, 1981], weights words in each sentence according to term frequency and in- verse document frequency (tf-idf ) and uses no semantic information. The second algorithm uses measures of the semantic distance between words belonging to the same part of speech. The third algorithm combines the tf-idf scores and the semantic distance scores between words.

We evaluate the performance of the second and third algorithms on two data sets: O’Shea’s set of sentence pairs with human similarity judgements [Li et al., Aug, Rubenstein and Goodenough, 1965], and Microsoft Research’s sentence-level paraphrase dataset [Rus et al., 2012]. On O’Shea’s data set, the third algorithm more accurately matches human judgments than the second. On the Microsoft data set, there was not a significant difference between the two algorithms.

Distribution Options

Access restricted to On-Campus only

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.