|dc.description.abstract||Part-of-Speech (POS) tagging is the process of labeling or classifying each word in
written text with its grammatical category or part-of-speech, i.e. noun, verb, preposition,
adjective, etc. It is the most common disambiguation process in the field of
Natural Language Processing (NLP). POS tagging systems are often preprocessors in
many NLP applications.
The Arabic language has a valuable and an important feature, called diacritics, which
are marks placed over and below the letters of the word. An Arabic text is partiallyvocalisedl
when the diacritical mark is assigned to one or maximum two letters in the
Diacritics in Arabic texts are extremely important especially at the end of the word.
They help determining not only the correct POS tag for each word in the sentence,
but also in providing full information regarding the inflectional features, such as tense,
number, gender, etc. for the sentence words. They add semantic information to words
which helps with resolving ambiguity in the meaning of words. Furthermore, diacritics
ascribe grammatical functions to the words, differentiating the word from other words,
and determining the syntactic position of the word in the sentence.
1. Vocalisation (also referred as diacritisation or vowelisation).
This thesis presents a rule-based Part-of-Speech tagging system called AMT - short
for Arabic Morphosyntactic Tagger. The main function of the AMT system is to assign
the correct tag to each word in an untagged raw partially-vocalised Arabic corpus,
and to produce a POS tagged corpus without using a manually tagged or untagged
lexicon (dictionary) for training. Two different techniques were used in this work, the
pattem-based technique and the lexical and contextual technique.
The rules in the pattem-based technique technique are based on the pattern of the
testing word. A novel algorithm, Pattern-Matching Algorithm (PMA), has been designed
and introduced in this work. The aim of this algorithm is to match the testing
word with its correct pattern in pattern lexicon.
The lexical and contextual technique on the other hand is used to assist the pattembased
technique technique to assign the correct tag to those words not have a pattern to
follow. The rules in the lexical and contextual technique are based on the character(s),
the last diacritical mark, the word itself, and the tags of the surrounding words.
The importance of utilizing the diacritic feature of the Arabic language to reduce the
lexical ambiguity in POS tagging has been addressed. In addition, a new Arabic tag
set and a new partially-vocalised Arabic corpus to test AMT have been compiled and
presented in this work. The AMT system has achieved an average accuracy of 91 %.||en