eprintid: 2781
rev_number: 2
eprint_status: archive
userid: 1
dir: disk0/00/00/27/81
datestamp: 2023-11-09 15:51:01
lastmod: 2023-11-09 15:51:01
status_changed: 2023-11-09 15:44:15
type: conference_item
metadata_visibility: show
creators_name: Zamin, N.
creators_name: Oxley, A.
creators_name: Abu Bakar, Z.
creators_name: Farhan, S.A.
title: A statistical dictionary-based word alignment algorithm: An unsupervised approach
ispublished: pub
keywords: Automated process; bigram; Corpus linguistics; Dice coefficient; Labour-intensive; malay language; Part of speech tagging; Part-of-speech tags; PoS tagging; Recall rate; Resource-Rich; Training data; Unsupervised approaches; Word alignment, Automation; Information science; Natural language processing systems; Technology, Research
note: cited By 9; Conference of 2012 International Conference on Computer and Information Science, ICCIS 2012 - A Conference of World Engineering, Science and Technology Congress, ESTCON 2012 ; Conference Date: 12 June 2012 Through 14 June 2012; Conference Code:93334
abstract: Malay is categorized as a resource-poor language. Thus, there is limited research on corpus linguistics for Malay. This paper discusses an automated process of applying part-of-speech (POS) tags to Malay words. Conventional tagging works well on static grammatical classes with little ambiguities, as performed in most research on resource-rich languages. However, the grammatical classes of Malay are dynamic, where adjectives can be verbs or adverbs and vice versa. This makes automatic POS tagging of Malay a chaotic and challenging process. There is no labelled data publicly available for Malay while hand-crafted corpora are labour-intensive and time-consuming. Hence, this paper introduces an unsupervised technique to tag Malay terrorism texts as a case study. This is a solution to partially overcome the shortage of annotated resources for Malay and the labour-intensity of a hand-tagged corpus. This approach does not require any labelled training data but involves translation of texts into a resource-rich language, i.e. English, and a dictionary look-up. After comparing the results with human annotators, it is found that the unsupervised technique reaches 76 precision and a 67 recall rate. Â© 2012 IEEE.
date: 2012
official_url: https://www.scopus.com/inward/record.uri?eid=2-s2.0-84867918947&doi=10.1109%2fICCISci.2012.6297278&partnerID=40&md5=279c0ce91b9138812b30197910d1567d
id_number: 10.1109/ICCISci.2012.6297278
full_text_status: none
publication: 2012 International Conference on Computer and Information Science, ICCIS 2012 - A Conference of World Engineering, Science and Technology Congress, ESTCON 2012 - Conference Proceedings
volume: 1
place_of_pub: Kuala Lumpur
pagerange: 396-402
refereed: TRUE
isbn: 9781467319386
citation:   Zamin, N. and Oxley, A. and Abu Bakar, Z. and Farhan, S.A.  (2012) A statistical dictionary-based word alignment algorithm: An unsupervised approach.  In: UNSPECIFIED.