Exploiting Wikipedia Semantics for Computing Word Associations

Jabeen, Shahida

doi:10.26686/wgtn.20387697

thesis.pdf (2.57 MB)

Exploiting Wikipedia Semantics for Computing Word Associations

thesis

posted on 2022-07-28, 00:05 authored by Jabeen, Shahida

Semantic association computation is the process of automatically quantifying the strength of a semantic connection between two textual units based on various lexical and semantic relations such as hyponymy (car and vehicle) and functional associations (bank and manager). Humans have can infer implicit relationships between two textual units based on their knowledge about the world and their ability to reason about that knowledge. Automatically imitating this behavior is limited by restricted knowledge and poor ability to infer hidden relations.

Various factors affect the performance of automated approaches to computing semantic association strength. One critical factor is the selection of a suitable knowledge source for extracting knowledge about the implicit semantic relations. In the past few years, semantic association computation approaches have started to exploit web-originated resources as substitutes for conventional lexical semantic resources such as thesauri, machine readable dictionaries and lexical databases. These conventional knowledge sources suffer from limitations such as coverage issues, high construction and maintenance costs and limited availability. To overcome these issues one solution is to use the wisdom of crowds in the form of collaboratively constructed knowledge sources. An excellent example of such knowledge sources is Wikipedia which stores detailed information not only about the concepts themselves but also about various aspects of the relations among concepts.

The overall goal of this thesis is to demonstrate that using Wikipedia for computing word association strength yields better estimates of humans' associations than the approaches based on other structured and unstructured knowledge sources. There are two key challenges to achieve this goal: first, to exploit various semantic association models based on different aspects of Wikipedia in developing new measures of semantic associations; and second, to evaluate these measures compared to human performance in a range of tasks. The focus of the thesis is on exploring two aspects of Wikipedia: as a formal knowledge source, and as an informal text corpus.

The first contribution of the work included in the thesis is that it effectively exploited the knowledge source aspect of Wikipedia by developing new measures of semantic associations based on Wikipedia hyperlink structure, informative-content of articles and combinations of both elements. It was found that Wikipedia can be effectively used for computing noun-noun similarity. It was also found that a model based on hybrid combinations of Wikipedia structure and informative-content based features performs better than those based on individual features. It was also found that the structure based measures outperformed the informative content based measures on both semantic similarity and semantic relatedness computation tasks.

The second contribution of the research work in the thesis is that it effectively exploited the corpus aspect of Wikipedia by developing a new measure of semantic association based on asymmetric word associations. The thesis introduced the concept of asymmetric associations based measure using the idea of directional context inspired by the free word association task. The underlying assumption was that the association strength can change with the changing context. It was found that the asymmetric association based measure performed better than the symmetric measures on semantic association computation, relatedness based word choice and causality detection tasks. However, asymmetric-associations based measures have no advantage for synonymy-based word choice tasks. It was also found that Wikipedia is not a good knowledge source for capturing verb-relations due to its focus on encyclopedic concepts specially nouns.

It is hoped that future research will build on the experiments and discussions presented in this thesis to explore new avenues using Wikipedia for finding deeper and semantically more meaningful associations in a wide range of application areas based on humans' estimates of word associations.

History

Copyright Date

2014-01-01

Date of Award

2014-01-01

Publisher

Te Herenga Waka—Victoria University of Wellington

Rights License

Author Retains Copyright

Degree Discipline

Computer Science

Degree Grantor

Te Herenga Waka—Victoria University of Wellington

Degree Level

Doctoral

Degree Name

Doctor of Philosophy

ANZSRC Socio-Economic Outcome code

890299 Computer Software and Services not elsewhere classified; 970108 Expanding Knowledhe in the Information and Computing Sciences

Victoria University of Wellington Item Type

Awarded Doctoral Thesis

Language

en_NZ

Victoria University of Wellington School

School of Engineering and Computer Science

Advisors

Gao, Xiaoying; Andreae, Peter

Usage metrics

Keywords

Wikipedia Semantic associations Knowledge sources School: School of Engineering and Computer Science 080107 Natural Language Processing 080704 Information Retrieval and Web Search 080705 Informetrics 080707 Organisation of Information and Knowledge Resources 890299 Computer Software and Services not elsewhere classified 970108 Expanding Knowledhe in the Information and Computing Sciences Degree Discipline: Computer Science Degree Level: Doctoral Degree Name: Doctor of Philosophy Natural Language Processing

Licence

Author Retains Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Exploiting Wikipedia Semantics for Computing Word Associations

History

Copyright Date

Date of Award

Publisher

Rights License

Degree Discipline

Degree Grantor

Degree Level

Degree Name

ANZSRC Socio-Economic Outcome code

Victoria University of Wellington Item Type

Language

Victoria University of Wellington School

Advisors

Usage metrics

Categories

Keywords

Licence

Exports