10.4124/LANCASTER/THESIS/26
Hardie, Andrew
0000-0002-2952-2545
Linguistics and English Language
ESRC Centre for Corpus Approaches to Social Science
The computational analysis of morphosyntactic categories in Urdu
Lancaster University
2004
part-of-speech tagging
morphosyntactic tagging
Urdu
Unicode
rule-based tagging
disambiguation
EAGLES guidelines
tagset
lexicon
English
Text
2861074 B
477 pages
application/pdf
License unspecified
Urdu is a language of the Indo-Aryan family, widely spoken in India and
Pakistan, and an important minority language in Europe, North America, and
elsewhere. This thesis describes the development of a computer-based
system for part-of-speech tagging of Urdu texts, consisting of a tagset, a
set of tagging guidelines for manual tagging or post-editing, and the
tagger itself. The tagset is defined in accordance with a set of design
principles, derived from a survey of good practice in the field of tagset
design, including compliance with the EAGLES guidelines on morphosyntactic
annotation. These are shown to be extensible to languages, such as Urdu,
that are closely related to those languages for which the guidelines were
originally devised. The description of Urdu grammar given by Schmidt
(1999) is used as a model of the language for the purpose of tagset
design. Manual tagging is undertaken using this tagset, by which process a
set of tagging guidelines are created, and a set of manually tagged texts
to serve as training data is obtained. A rule-based methodology is used
here to perform tagging in Urdu. The justification for this choice is
discussed. A suite of programs which function together within the Unitag
architecture are described. This system (as well as a tokeniser) includes
an analyser (Urdutag) based on lexical look-up and word-form analysis, and
a disambiguator (Unirule) which removes contextually inappropriate tags
using a set of 274 rules. While the system's final performance is not
particularly impressive, this is largely due to a paucity of training data
leading to a small lexicon, rather than any substantial flaw in the
system.