Word Sense Disambiguation Corpus for Kashmiri

Mir, Tawseef Ahmad; Lawaye, Aadil Ahmad

Please use this identifier to cite or link to this item: https://gnanaganga.inflibnet.ac.in:8443/jspui/handle/123456789/16900

Full metadata record

DC Field	Value	Language
dc.contributor.author	Mir, Tawseef Ahmad	-
dc.contributor.author	Lawaye, Aadil Ahmad	-
dc.date.accessioned	2024-12-12T09:38:21Z	-
dc.date.available	2024-12-12T09:38:21Z	-
dc.date.issued	2024	-
dc.identifier.issn	2977-0424	-
dc.identifier.uri	https://doi.org/10.1017/nlp.2024.31	-
dc.identifier.uri	https://gnanaganga.inflibnet.ac.in:8443/jspui/handle/123456789/16900	-
dc.description.abstract	Ambiguity is considered an indispensable attribute of all natural languages. The process of associating the precise interpretation to an ambiguous word taking into consideration the context in which it occurs is known as word sense disambiguation (WSD). Supervised approaches to WSD are showing better performance in contrast to their counterparts. These approaches, however, require sense annotated corpus to carry out the disambiguation process. This paper presents the first-ever standard WSD dataset for the Kashmiri language. The raw corpus used to develop the sense annotated dataset is collected from different resources and contains about 1 M tokens. The sense-annotated corpus is then created using this raw corpus for 124 commonly used ambiguous Kashmiri words. Kashmiri WordNet, an important lexical resource for the Kashmiri language, is used for obtaining the senses used in the annotation process. The developed sense-tagged corpus is multifarious in nature and has 19,854 sentences. Based on this annotated corpus, the Lexical Sample WSD task for Kashmiri is carried out using different machine-learning algorithms (J48, IBk, Naive Bayes, Dl4jMlpClassifier, SVM). To train these models for the WSD task, bag-of-words (BoW) and word embeddings obtained using the Word2Vec model are used. We used different standard measures, viz. accuracy, precision, recall, and F1-measure, to calculate the performance of these algorithms. Different machine learning algorithms reported different values for these measures on using different features. In the case of BoW model, SVM reported better results than other algorithms used, whereas Dl4jMlpClassifier performed better with word embeddings.	en_US
dc.language.iso	en	en_US
dc.publisher	Natural Language Processing	en_US
dc.publisher	Cambridge Univ Press	en_US
dc.subject	Information Extraction	en_US
dc.subject	Machine Learning	en_US
dc.subject	Sense Annotation	en_US
dc.subject	Word Sense Disambiguation	en_US
dc.title	Word Sense Disambiguation Corpus for Kashmiri	en_US
dc.type	Article	en_US
Appears in Collections:	Journal Articles

Files in This Item:

File	Size	Format
word-sense-disambiguation-corpus-for-kashmiri.pdf	10.68 MB	Adobe PDF	View/Open

Show simple item record

Alliance University, Bengaluru

Institutional Repository