Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на этот ресурс: http://hdl.handle.net/11701/44183
Полная запись метаданных
Поле DCЗначениеЯзык
dc.contributor.authorGudkov, Vadim V.-
dc.contributor.authorMitrenina, Olga V.-
dc.contributor.authorSokolov, Evgenii G.-
dc.contributor.authorKoval, Angelina A.-
dc.date.accessioned2023-10-10T11:25:46Z-
dc.date.available2023-10-10T11:25:46Z-
dc.date.issued2023-06-
dc.identifier.citationGudkov V. V., Mitrenina O. V., Sokolov E. G., Koval A. A. Language-based transfer learning approaches for part-of-speech tagging on Saint Petersburg Corpus of Hagiographic texts (SKAT). Vestnik of Saint Petersburg University. Language and Literature. 2023, 20 (2): 268–282. https://doi.org/10.21638/spbu09.2023.205 (In Russian)en_GB
dc.identifier.otherhttps://doi.org/10.21638/spbu09.2023.205-
dc.identifier.urihttp://hdl.handle.net/11701/44183-
dc.description.abstractThe article describes an experiment about training a part-of-speech tagger using artificial neural networks on the St. Petersburg Corpus of Hagiographic Texts (SKAT), which is being developed at the Department of Mathematical Linguistics of St. Petersburg State University. The corpus includes the texts of 23 manuscripts dating from the 15th–18th centuries with about 190,000 words usages, four of which were labelled manually. The bi-LSTM, distilled RuBERTtiny2 and RuBERT models were used to train a POS tagger. All of them were trained on modern Russian corpora and further fine-tuned to label Old Russian texts using a technique called language transfer. To fine-tune transformer-based language models it was necessary to tokenize the texts using byte pair encoding and map tokens from the original Russian-language tokenizer to the new one based on indices. Then the model was fine-tuned for the token classification task. To fine-tune the model, a tagged subcorpus of three hagiographical texts was used, which included 35,603 tokens and 2,885 sentences. The experiment took into account only the tags of the parts of speech, the classification included seventeen tags, thirteen of which corresponded to parts of speech, and the remaining four marked punctuation marks. To evaluate the quality of the model, the standard metrics F1 and Accuracy were used. According to automatic evaluation metrics, the RuBERT model showed the best result. Most of the errors were related to incorrect generalization of linear position patterns or to the similarity of word forms in both the extreme left and extreme right positions.en_GB
dc.language.isoruen_GB
dc.publisherSt Petersburg State Universityen_GB
dc.relation.ispartofseriesVestnik of St Petersburg University. Language and Literature;Volume 20; Issue 2-
dc.subjecthagiographyen_GB
dc.subjectcorpus of Old Russian textsen_GB
dc.subjectneural network taggingen_GB
dc.subjectlanguage-based transfer learningen_GB
dc.subjectpart-of speech taggingen_GB
dc.titleLanguage-based transfer learning approaches for part-of-speech tagging on Saint Petersburg Corpus of Hagiographic texts (SKAT)en_GB
dc.typeArticleen_GB
Располагается в коллекциях:Issue 2

Файлы этого ресурса:
Файл Описание РазмерФормат 
05.pdf742,78 kBAdobe PDFПросмотреть/Открыть


Все ресурсы в архиве электронных ресурсов защищены авторским правом, все права сохранены.