Chapter 1. Graphematical analysis

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

St Petersburg State University

Abstract

Graphematical analysis marks the first stage of text processing. However, prior to it, basic text structuring takes place, resulting in the identification of paragraphs and their types, e.g. title, subtitle, author name(s), chapter and section titles, footnotes, endnotes, figures, appendices, epigraphs, etc. After that, graphematical analysis proper begins. Its aim is to decompose the flow of letter and non-letter graphemes into character strings such as individual words, abbreviations, numbers, and hybrid strings (e.g. mathematical formulae). The procedure implies an iterative process of unit assembling, from individual characters to what is called atoms, next to tokens (roughly equivalent to word occurrences), sentence parts and finally, a whole sentence. At every stage, each unit is assigned its type. Assembling relies on the rules based solely on a thorough structural analysis of context. No formal models or statistical methods are applied, this being a central principle of the linguistic analyzer, inherent in all its algorithms. At this stage, complications arise primarily through the ambiguity of punctuation marks. They are discussed at length throughout the chapter.

Description

Citation

Endorsement

Review

Supplemented By

Referenced By