Chapter 1. Graphematical analysis
Loading...
Files
Date
Journal Title
Journal ISSN
Volume Title
Publisher
St Petersburg State University
Abstract
Graphematical analysis marks the first stage of text processing. However, prior to it, basic text structuring takes place, resulting in the identification of paragraphs and their
types, e.g. title, subtitle, author name(s), chapter and section titles, footnotes, endnotes,
figures, appendices, epigraphs, etc. After that, graphematical analysis proper begins. Its
aim is to decompose the flow of letter and non-letter graphemes into character strings
such as individual words, abbreviations, numbers, and hybrid strings (e.g. mathematical
formulae). The procedure implies an iterative process of unit assembling, from individual
characters to what is called atoms, next to tokens (roughly equivalent to word occurrences), sentence parts and finally, a whole sentence. At every stage, each unit is assigned
its type. Assembling relies on the rules based solely on a thorough structural analysis of
context. No formal models or statistical methods are applied, this being a central principle of the linguistic analyzer, inherent in all its algorithms. At this stage, complications
arise primarily through the ambiguity of punctuation marks. They are discussed at length
throughout the chapter.