Language-based transfer learning approaches for part-of-speech tagging on Saint Petersburg Corpus of Hagiographic texts (SKAT)

Gudkov, Vadim V.; Mitrenina, Olga V.; Sokolov, Evgenii G.; Koval, Angelina A.

Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на этот ресурс: http://hdl.handle.net/11701/44183

Полная запись метаданных

Поле DC	Значение	Язык
dc.contributor.author	Gudkov, Vadim V.	-
dc.contributor.author	Mitrenina, Olga V.	-
dc.contributor.author	Sokolov, Evgenii G.	-
dc.contributor.author	Koval, Angelina A.	-
dc.date.accessioned	2023-10-10T11:25:46Z	-
dc.date.available	2023-10-10T11:25:46Z	-
dc.date.issued	2023-06	-
dc.identifier.citation	Gudkov V. V., Mitrenina O. V., Sokolov E. G., Koval A. A. Language-based transfer learning approaches for part-of-speech tagging on Saint Petersburg Corpus of Hagiographic texts (SKAT). Vestnik of Saint Petersburg University. Language and Literature. 2023, 20 (2): 268–282. https://doi.org/10.21638/spbu09.2023.205 (In Russian)	en_GB
dc.identifier.other	https://doi.org/10.21638/spbu09.2023.205	-
dc.identifier.uri	http://hdl.handle.net/11701/44183	-
dc.description.abstract	The article describes an experiment about training a part-of-speech tagger using artificial neural networks on the St. Petersburg Corpus of Hagiographic Texts (SKAT), which is being developed at the Department of Mathematical Linguistics of St. Petersburg State University. The corpus includes the texts of 23 manuscripts dating from the 15th–18th centuries with about 190,000 words usages, four of which were labelled manually. The bi-LSTM, distilled RuBERTtiny2 and RuBERT models were used to train a POS tagger. All of them were trained on modern Russian corpora and further fine-tuned to label Old Russian texts using a technique called language transfer. To fine-tune transformer-based language models it was necessary to tokenize the texts using byte pair encoding and map tokens from the original Russian-language tokenizer to the new one based on indices. Then the model was fine-tuned for the token classification task. To fine-tune the model, a tagged subcorpus of three hagiographical texts was used, which included 35,603 tokens and 2,885 sentences. The experiment took into account only the tags of the parts of speech, the classification included seventeen tags, thirteen of which corresponded to parts of speech, and the remaining four marked punctuation marks. To evaluate the quality of the model, the standard metrics F1 and Accuracy were used. According to automatic evaluation metrics, the RuBERT model showed the best result. Most of the errors were related to incorrect generalization of linear position patterns or to the similarity of word forms in both the extreme left and extreme right positions.	en_GB
dc.language.iso	ru	en_GB
dc.publisher	St Petersburg State University	en_GB
dc.relation.ispartofseries	Vestnik of St Petersburg University. Language and Literature;Volume 20; Issue 2	-
dc.subject	hagiography	en_GB
dc.subject	corpus of Old Russian texts	en_GB
dc.subject	neural network tagging	en_GB
dc.subject	language-based transfer learning	en_GB
dc.subject	part-of speech tagging	en_GB
dc.title	Language-based transfer learning approaches for part-of-speech tagging on Saint Petersburg Corpus of Hagiographic texts (SKAT)	en_GB
dc.type	Article	en_GB
Располагается в коллекциях:	Issue 2

Файлы этого ресурса:

Файл	Описание	Размер	Формат
05.pdf		742,78 kB	Adobe PDF	Просмотреть/Открыть

Показать базовое описание ресурса Просмотр статистики

Все ресурсы в архиве электронных ресурсов защищены авторским правом, все права сохранены.

Архив открытого доступаСанкт-Петербургского государственного университета

Архив открытого доступа
Санкт-Петербургского государственного университета