Methods of Automatic Russian Text Summarization
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Актуальность темы исследования. Сегодня мы живем в эпоху «информационного взрыва», когда количество ежедневно поступающей информации превышает объективные возможности ее восприятия человеком. Именно в связи с резким ростом объема текстовых данных, в сфере компьютерной лингвистики возникла задача автоматического обобщения текста, целью которой является извлечение полезной информации из текста и составление краткого и грамотного реферата. В последнее время модели на основе нейронных архитектур достигли заметных успехов во многих областях обработки естественного языка. Несмотря на то, что исследования в данной области ведутся уже более полувека, подавляющее большинство работ направлено на создание текстовых резюме на английском языке. Таким образом, вопрос о применении существующих нейронных систем реферирования к текстам на русском языке приобретает особую актуальность и требует проведения глубокого исследования.
Объектом исследования являются методы автоматического реферирования текстов.
Предметом исследования является применение нейросетевых методов для решения задачи автоматического реферирования текстов на русском языке.
Цель исследования состоит в изучении методов автоматического реферирования текстов и написании программного кода для реализации таких нейросетевых моделей, как SummaRuNNer и Pointer-Generator Network, а также для настройки предварительно обученных языковых моделей ruBERT и mBART, и их применения в решении задачи обобщения текстов на русском языке.
Для достижения поставленной цели были сформулированы следующие задачи: провести обзор исследований методов автоматического реферирования текстов; изучить архитектуру нейросетевых моделей, применяющихся для автоматического реферирования текстов; построить несколько систем автоматического обобщения русскоязычных текстов на основе нейросетевых моделей; осуществить оценку применения данных систем к русскоязычным текстам; идентифицировать ошибки в реферировании текстов данными методами и предложить возможное объяснение их появлению.
В процессе решения поставленных задач были сделаны следующие основные выводы:
1. Модели экстрактивного реферирования, как обученные с нуля (SummaRuNNer), так и на основе предварительно обученных моделей (ruBERT), способны выявлять предложения, содержащие основной посыл оригинального текста. Данные модели также не имеют проблем с соблюдением грамматических норм русского языка, так как данный подход не задействует генерации текста как таковой. Минусами данного подхода является необходимость в наличии размеченного исходного текста, а также сравнительная избыточность текста, поскольку данный подход не подразумевает любого рода модификаций выбранных предложений.
2. В рефератах моделей абстрактивного типа, в частности Pointer-Generator Network (PGN), обнаруживаются различные грамматические ошибки. Так например, в текстах PGN, обученной на уровне слов, были замечены ошибки на синтаксическом и лексическом уровнях, а в рефератах PGN, обученной на уровне подслов, были обнаружены словообразовательные ошибки. Вместе с этим, модели PGN склонны к генерации повторяющихся цепочек токенов, которую мы успешно разрешили путем имплементации механизма покрывающего внимания.
3. Модель абстрактивной саммаризации, основанная на предобученной языковой модели mBART, показала наилучшие результаты реферирования из всех рассмотренных нами моделей. Данные результаты объясняются переносом признаков, извлеченных системой при обучении на оригинальной задаче, а также наличием объемного обучающего словаря.
4. Возможными причинами ошибок экстрактивных моделей могут являться тексты, носящие справочный характер (когда каждое предложение в исходном тексте имеет равную информационную ценность), недостаточный объем обучающего словаря, а также специфичность обучающих данных. В свою очередь, вероятными причинами ошибок абстрактивных моделей также можно считать недостаточный объем обучающих данных, малый размер обучающего словаря и сравнительно упрощенная архитектура, связанная с ограниченными вычислительными ресурсами.
Делая выводы о практической применимости диссертационного исследования, можно предположить, что реализованные алгоритмы, а также полученные в результате их работы тексты, классификацию ошибок и выводы можно использовать для дальнейших исследований в данной области.
Relevance of the research topic. Today we live in the era of information explosion when the amount of information received daily exceeds the objective possibilities of its processing by a person. The sharp increase in the volume of the text data has challenged the field of computational linguistics with the task of automatic text summarization, the purpose of which is to extract useful information from the text and compile a concise and fluent summary that conveys its main essence. Recently, models based on neural architectures have made notable advances in many areas of natural language processing. Despite the fact that the research in this area has been going on for more than half a century, the vast majority of works in this direction consider summarizing texts in English. Thus, the question of applying existing neural summarization systems to texts in Russian is of particular relevance and requires in-depth research. The object of the research is the methods of automatic text summarization. The subject of the research is the application of neural network methods for solving the problem of Russian text summarization. The aim of the research is to study the methods of automatic text summarization and to write a program code for the implementation of such neural network models as SummaRuNNer and Pointer-Generator Network, as well as fine-tuning pre-trained language models, including ruBERT and mBART, and applying them to the problem of Russian text summarization. To achieve the goal, the following tasks were formulated: to review the research on methods of automatic summarization; to study the architecture of neural network models used for summarization; implement several systems for automatic summarization of Russian-language texts based on the neural network models; to evaluate the application of these systems to Russian-language texts; identify errors in texts generated by these models and offer a possible explanation for their nature. In the process of solving the above-mentioned tasks, the following main conclusions were made: 1. Extractive summarization models, both trained from scratch (SummaRuNNer) and based on pre-trained models (ruBERT), are able to identify the sentences that contain the main message of the original text. These models also do not appear to have problems with the observance of the grammatical norms of the Russian language, since this approach does not involve text generation as such. The disadvantages of this approach are the need for labelled source texts, as well as the comparative redundancy of the summaries as this approach does not imply any kind of modification of the extracted sentences. 2. In the texts generated by the abstractive models, in particular the Pointer-Generator Network (PGN), various grammatical errors were identified. So, for example, in summaries of PGN trained at the word level, errors were detected at both the syntactic and lexical levels of the language, and in the summaries of PGN trained at the subword level, word-formation errors took place. At the same time, PGN models have a tendency for sequence repetition, which we successfully resolved by implementing a coverage mechanism. 3. The abstract summarization model, based on the pre-trained mBART language model, showed the best results of all the models we considered. These results can be explained by the transfer of features extracted by the system during the training on the original task, as well as by the presence of a large training vocabulary. 4. Possible reasons for errors in extractive models can be reference texts (when each sentence in the source text has equal informational value), an insufficient number of words in the training vocabulary, and the specificity of the training data. On the other hand, the probable reasons for the errors of abstractive models can also be considered to be an insufficient amount of training data, the small size of the training vocabulary, and a relatively simplified architecture due to limited computational resources. Drawing a conclusion about the practical applicability of the thesis research, we may assume that the implemented algorithms, as well as the texts produced, the classification of errors, and results can be used for further research in this area.
Relevance of the research topic. Today we live in the era of information explosion when the amount of information received daily exceeds the objective possibilities of its processing by a person. The sharp increase in the volume of the text data has challenged the field of computational linguistics with the task of automatic text summarization, the purpose of which is to extract useful information from the text and compile a concise and fluent summary that conveys its main essence. Recently, models based on neural architectures have made notable advances in many areas of natural language processing. Despite the fact that the research in this area has been going on for more than half a century, the vast majority of works in this direction consider summarizing texts in English. Thus, the question of applying existing neural summarization systems to texts in Russian is of particular relevance and requires in-depth research. The object of the research is the methods of automatic text summarization. The subject of the research is the application of neural network methods for solving the problem of Russian text summarization. The aim of the research is to study the methods of automatic text summarization and to write a program code for the implementation of such neural network models as SummaRuNNer and Pointer-Generator Network, as well as fine-tuning pre-trained language models, including ruBERT and mBART, and applying them to the problem of Russian text summarization. To achieve the goal, the following tasks were formulated: to review the research on methods of automatic summarization; to study the architecture of neural network models used for summarization; implement several systems for automatic summarization of Russian-language texts based on the neural network models; to evaluate the application of these systems to Russian-language texts; identify errors in texts generated by these models and offer a possible explanation for their nature. In the process of solving the above-mentioned tasks, the following main conclusions were made: 1. Extractive summarization models, both trained from scratch (SummaRuNNer) and based on pre-trained models (ruBERT), are able to identify the sentences that contain the main message of the original text. These models also do not appear to have problems with the observance of the grammatical norms of the Russian language, since this approach does not involve text generation as such. The disadvantages of this approach are the need for labelled source texts, as well as the comparative redundancy of the summaries as this approach does not imply any kind of modification of the extracted sentences. 2. In the texts generated by the abstractive models, in particular the Pointer-Generator Network (PGN), various grammatical errors were identified. So, for example, in summaries of PGN trained at the word level, errors were detected at both the syntactic and lexical levels of the language, and in the summaries of PGN trained at the subword level, word-formation errors took place. At the same time, PGN models have a tendency for sequence repetition, which we successfully resolved by implementing a coverage mechanism. 3. The abstract summarization model, based on the pre-trained mBART language model, showed the best results of all the models we considered. These results can be explained by the transfer of features extracted by the system during the training on the original task, as well as by the presence of a large training vocabulary. 4. Possible reasons for errors in extractive models can be reference texts (when each sentence in the source text has equal informational value), an insufficient number of words in the training vocabulary, and the specificity of the training data. On the other hand, the probable reasons for the errors of abstractive models can also be considered to be an insufficient amount of training data, the small size of the training vocabulary, and a relatively simplified architecture due to limited computational resources. Drawing a conclusion about the practical applicability of the thesis research, we may assume that the implemented algorithms, as well as the texts produced, the classification of errors, and results can be used for further research in this area.
Description
Keywords
автоматическое реферирование текста, русский язык, рекуррентные нейронные сети, Transformer, SummaRuNNer, Pointer-Generator Network, трансферное обучение, ruBERT, mBART, Text Summarization, Russian language, Recurrent Neural Networks, Transformer, SummaRuNNer, Pointer-Generator Network, Transfer Learning, ruBERT, mBART