Date of Submission
Project Advisor 1
English sentence similarity measure is used in a vast number of applications such as online web page information retrieval systems, online advertisements, question answering dialogue systems, text summarization, text mining. Over the years, a number of algorithms have been proposed for this difficult problem, but none of the proposed algorithms give sufficient good answer.
In this project, we explore three different algorithms for computing English sentence simi- larity. The first algorithm, which is well-explored in the literature [Salton and Buckley, 1988, Wu and Salton, 1981], weights words in each sentence according to term frequency and in- verse document frequency (tf-idf ) and uses no semantic information. The second algorithm uses measures of the semantic distance between words belonging to the same part of speech. The third algorithm combines the tf-idf scores and the semantic distance scores between words.
We evaluate the performance of the second and third algorithms on two data sets: O’Shea’s set of sentence pairs with human similarity judgements [Li et al., Aug, Rubenstein and Goodenough, 1965], and Microsoft Research’s sentence-level paraphrase dataset [Rus et al., 2012]. On O’Shea’s data set, the third algorithm more accurately matches human judgments than the second. On the Microsoft data set, there was not a significant difference between the two algorithms.
Access restricted to On-Campus only
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.
Zaman, Anis, "A Statistical and Semantic Approach to Sentence Similarity" (2013). Senior Projects Spring 2013. 339.