A morphological-syntactical analysis approach for Arabic textual tagging
Part-of-Speech (POS) tagging is the process of labeling or classifying each word in written text with its grammatical category or part-of-speech, i.e. noun, verb, preposition, adjective, etc. It is the most common disambiguation process in the field of Natural Language Processing (NLP). POS tagging systems are often preprocessors in many NLP applications. The Arabic language has a valuable and an important feature, called diacritics, which are marks placed over and below the letters of the word. An Arabic text is partiallyvocalisedl when the diacritical mark is assigned to one or maximum two letters in the word. Diacritics in Arabic texts are extremely important especially at the end of the word. They help determining not only the correct POS tag for each word in the sentence, but also in providing full information regarding the inflectional features, such as tense, number, gender, etc. for the sentence words. They add semantic information to words which helps with resolving ambiguity in the meaning of words. Furthermore, diacritics ascribe grammatical functions to the words, differentiating the word from other words, and determining the syntactic position of the word in the sentence. 1. Vocalisation (also referred as diacritisation or vowelisation). This thesis presents a rule-based Part-of-Speech tagging system called AMT - short for Arabic Morphosyntactic Tagger. The main function of the AMT system is to assign the correct tag to each word in an untagged raw partially-vocalised Arabic corpus, and to produce a POS tagged corpus without using a manually tagged or untagged lexicon (dictionary) for training. Two different techniques were used in this work, the pattem-based technique and the lexical and contextual technique. The rules in the pattem-based technique technique are based on the pattern of the testing word. A novel algorithm, Pattern-Matching Algorithm (PMA), has been designed and introduced in this work. The aim of this algorithm is to match the testing word with its correct pattern in pattern lexicon. The lexical and contextual technique on the other hand is used to assist the pattembased technique technique to assign the correct tag to those words not have a pattern to follow. The rules in the lexical and contextual technique are based on the character(s), the last diacritical mark, the word itself, and the tags of the surrounding words. The importance of utilizing the diacritic feature of the Arabic language to reduce the lexical ambiguity in POS tagging has been addressed. In addition, a new Arabic tag set and a new partially-vocalised Arabic corpus to test AMT have been compiled and presented in this work. The AMT system has achieved an average accuracy of 91 %.
- PhD