Native speakers of English know intuitively when to use articles ("a/an", "the" or zero-article), but deFining when to use these common words for non-native speakers or machine translation systems is anything but straightforward. This thesis explores machine learning approaches to determining when to use an article for a given noun phrase, focusing on the effect of different genres and features on a model's performance. We start with a theoretical overview of what articles are and how they're used, followed by a summary of previous rule-based and machine learning approaches. We then evaluate a neural network model on six different genres of text (EU legislation, Fiction, news, parallel web pages, subtitles and technical documentation) and Find that genres with smaller vocabularies and more exact repetition perform best. Four feature families are used (lexical, morphological, syntactic and discourse), combining features from previous rule-based and machine learning approaches. Using feature importance and feature ablation tests, we Find that lexical and morphological features are the most salient, while syntactic and discourse features contribute little to improve a model's performance.
ThriftBooks sells millions of used books at the lowest
everyday prices. We personally assess every book's quality and offer rare, out-of-print treasures. We
deliver the joy of reading in recyclable packaging with free standard shipping on US orders over $15.
ThriftBooks.com. Read more. Spend less.