%0 Conference Paper %A Baseer, F. %A Jaafar, J. %A Aziz, I.B.A. %A Habib, A. %D 2020 %F scholars:12637 %I Institute of Electrical and Electronics Engineers Inc. %K Computation theory; Computational methods; Intelligent computing, Computational model; Edit distance; K-means clustering techniques; Potential selection; Tokenization; Urdu lexicon; User friendly; Written communications, K-means clustering %P 57-62 %R 10.1109/ICCI51257.2020.9247814 %T Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset %U https://khub.utp.edu.my/scholars/12637/ %X Urdu is among the most widely used languages in the world for verbal and written communication. Due to lack of optimized and user friendly native Urdu-script support on various platforms, it is mostly written in Romanized script in soft form. In our research, we have developed a refined Urdu lexicon using tokens with the highest frequency of occurrence in the data set. This data set is basically a raw corpus of colloquial Urdu written in Romanized script. The corpus was collected from volunteer participants who used this language as a mode of communication on the Internet and text massaging. The raw corpus is passed through a series of steps such as Prepossessing, Tokenization and Annotation before passing it to computationally extensive subsequent steps. Edit Distance and K-means Clustering techniques are used for identification of candidate tokens and their potential selection/ inclusion in the refined lexicon. We have also identified most commonly used tokens, candidate tokens and other lingual attributes from the data collected. Based on analysis, we have proposed a computational model for refined colloquial Romanized Urdu lexicon development. © 2020 IEEE. %Z cited By 0; Conference of 2020 International Conference on Computational Intelligence, ICCI 2020 ; Conference Date: 8 October 2020 Through 9 October 2020; Conference Code:164916