Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на этот ресурс:
http://hdl.handle.net/11701/44183
Полная запись метаданных
Поле DC | Значение | Язык |
---|---|---|
dc.contributor.author | Gudkov, Vadim V. | - |
dc.contributor.author | Mitrenina, Olga V. | - |
dc.contributor.author | Sokolov, Evgenii G. | - |
dc.contributor.author | Koval, Angelina A. | - |
dc.date.accessioned | 2023-10-10T11:25:46Z | - |
dc.date.available | 2023-10-10T11:25:46Z | - |
dc.date.issued | 2023-06 | - |
dc.identifier.citation | Gudkov V. V., Mitrenina O. V., Sokolov E. G., Koval A. A. Language-based transfer learning approaches for part-of-speech tagging on Saint Petersburg Corpus of Hagiographic texts (SKAT). Vestnik of Saint Petersburg University. Language and Literature. 2023, 20 (2): 268–282. https://doi.org/10.21638/spbu09.2023.205 (In Russian) | en_GB |
dc.identifier.other | https://doi.org/10.21638/spbu09.2023.205 | - |
dc.identifier.uri | http://hdl.handle.net/11701/44183 | - |
dc.description.abstract | The article describes an experiment about training a part-of-speech tagger using artificial neural networks on the St. Petersburg Corpus of Hagiographic Texts (SKAT), which is being developed at the Department of Mathematical Linguistics of St. Petersburg State University. The corpus includes the texts of 23 manuscripts dating from the 15th–18th centuries with about 190,000 words usages, four of which were labelled manually. The bi-LSTM, distilled RuBERTtiny2 and RuBERT models were used to train a POS tagger. All of them were trained on modern Russian corpora and further fine-tuned to label Old Russian texts using a technique called language transfer. To fine-tune transformer-based language models it was necessary to tokenize the texts using byte pair encoding and map tokens from the original Russian-language tokenizer to the new one based on indices. Then the model was fine-tuned for the token classification task. To fine-tune the model, a tagged subcorpus of three hagiographical texts was used, which included 35,603 tokens and 2,885 sentences. The experiment took into account only the tags of the parts of speech, the classification included seventeen tags, thirteen of which corresponded to parts of speech, and the remaining four marked punctuation marks. To evaluate the quality of the model, the standard metrics F1 and Accuracy were used. According to automatic evaluation metrics, the RuBERT model showed the best result. Most of the errors were related to incorrect generalization of linear position patterns or to the similarity of word forms in both the extreme left and extreme right positions. | en_GB |
dc.language.iso | ru | en_GB |
dc.publisher | St Petersburg State University | en_GB |
dc.relation.ispartofseries | Vestnik of St Petersburg University. Language and Literature;Volume 20; Issue 2 | - |
dc.subject | hagiography | en_GB |
dc.subject | corpus of Old Russian texts | en_GB |
dc.subject | neural network tagging | en_GB |
dc.subject | language-based transfer learning | en_GB |
dc.subject | part-of speech tagging | en_GB |
dc.title | Language-based transfer learning approaches for part-of-speech tagging on Saint Petersburg Corpus of Hagiographic texts (SKAT) | en_GB |
dc.type | Article | en_GB |
Располагается в коллекциях: | Issue 2 |
Файлы этого ресурса:
Файл | Описание | Размер | Формат | |
---|---|---|---|---|
05.pdf | 742,78 kB | Adobe PDF | Просмотреть/Открыть |
Все ресурсы в архиве электронных ресурсов защищены авторским правом, все права сохранены.