relation: https://khub.utp.edu.my/scholars/12637/
title: Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset
creator: Baseer, F.
creator: Jaafar, J.
creator: Aziz, I.B.A.
creator: Habib, A.
description: Urdu is among the most widely used languages in the world for verbal and written communication. Due to lack of optimized and user friendly native Urdu-script support on various platforms, it is mostly written in Romanized script in soft form. In our research, we have developed a refined Urdu lexicon using tokens with the highest frequency of occurrence in the data set. This data set is basically a raw corpus of colloquial Urdu written in Romanized script. The corpus was collected from volunteer participants who used this language as a mode of communication on the Internet and text massaging. The raw corpus is passed through a series of steps such as Prepossessing, Tokenization and Annotation before passing it to computationally extensive subsequent steps. Edit Distance and K-means Clustering techniques are used for identification of candidate tokens and their potential selection/ inclusion in the refined lexicon. We have also identified most commonly used tokens, candidate tokens and other lingual attributes from the data collected. Based on analysis, we have proposed a computational model for refined colloquial Romanized Urdu lexicon development. Â© 2020 IEEE.
publisher: Institute of Electrical and Electronics Engineers Inc.
date: 2020
type: Conference or Workshop Item
type: PeerReviewed
identifier:   Baseer, F. and Jaafar, J. and Aziz, I.B.A. and Habib, A.  (2020) Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset.  In: UNSPECIFIED.     
relation: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85097536620&doi=10.1109%2fICCI51257.2020.9247814&partnerID=40&md5=1b1f615b9f333e079497762ef059e259
relation: 10.1109/ICCI51257.2020.9247814
identifier: 10.1109/ICCI51257.2020.9247814