Automatic detection of hidden communities in the texts of the Russian social network corpus

Мамаев Иван Дмитриевич; Mamaev Ivan Dmitrievic

Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на этот ресурс: http://hdl.handle.net/11701/31089

Полная запись метаданных

Поле DC	Значение	Язык
dc.contributor.advisor	Лукашевич Наталья Валентиновна	ru_RU
dc.contributor.advisor	Lukasevic Natala Valentinovna	en_GB
dc.contributor.author	Мамаев Иван Дмитриевич	ru_RU
dc.contributor.author	Mamaev Ivan Dmitrievic	en_GB
dc.contributor.editor	Митрофанова Ольга Александровна	ru_RU
dc.contributor.editor	Mitrofanova Olga Aleksandrovna	en_GB
dc.date.accessioned	2021-07-31T18:39:33Z	-
dc.date.available	2021-07-31T18:39:33Z	-
dc.date.issued	2021
dc.identifier.other	079541	en_GB
dc.identifier.uri	http://hdl.handle.net/11701/31089	-
dc.description.abstract	Актуальность работы обусловлена необходимостью сочетания методов лингвистического и количественного анализа данных для построения семантической сети пользователей на основе тематической близости их постов. Объект исследования – скрытые сообщества, выявленные с помощью алгоритмов тематического моделирования среди пользователей русскоязычного сегмента социальной сети ВКонтакте. Предмет исследования – количественные и качественные параметры скрытых сообществ, которые позволяют судить об актуальном состоянии групп. Цель исследования состоит в том, чтобы выявить скрытые связи между пользовательскими постами социальных сетей, основанные на общности лингвистических параметров текстов и их тематики. В перечень решаемых задач были включены такие задачи, как составление корпуса и его предобработка, построение автор-тематических моделей, разработка процедуры назначения меток тем и построение итогового графа скрытых сообществ. Теоретической базой исследования послужили работы отечественных и зарубежных лингвистов, а именно работы В.И. Карасика, А.А. Матусевич, О.Н. Морозовой по исследованию лингвистических особенностей текстов в Интернете, работы В.Ф. Хорошевского, П.А. Мейлахса, К. Хе, С. Фортунато по выявлению скрытых сообществ, работа О.А. Митрофановой, А.В. Крюковой, А.С. Смоляровой, К. Мей по тематическому моделированию. Материалом исследования послужил корпус из 25768 русскоязычных постов 714 пользователей социальной сети ВКонтакте объёмом 8679402 токенов. Работа состоит из четырёх глав, введения, заключения, списка использованной литературы, списка электронных ресурсов и одного приложения. Основные положения, выносимые на защиту. 1. Алгоритмы кластерного анализа и графовые методы выделяют рассредоточенные скрытые сообщества, что объясняется неполноценным учётом семантики исследуемых текстовых коллекций. 2. Отличительной особенностью текстов постов является использование ненормативных правил написания, что приводит к трудностям при автоматической обработке текстов, поэтому алгоритмы токенизации и нормализации приходится постоянно дорабатывать. 3. Учёт тематической составляющей при выявлении скрытых сообществ позволяет включить пользователя в несколько скрытых сообществ. Полученные результаты свидетельствуют о пригодности алгоритма для выявления скрытых сообществ, в заключении приводятся основные положения для последующих экспериментов. Работа была апробирована на конференциях AINL 2020 и TEL 2020, основные результаты представлены в следующих статьях (Scopus). 1. Mamaev I. Automatic Detection of Hidden Communities in the Texts of Russian Social Network Corpus / I. Mamaev, O. Mitrofanova // Artificial Intelligence and Natural Language. AINL 2020. Communications in Computer and Information Science. — Vol. 1292. — Springer, Cham, 2020. — P. 17–33. 2. Mamaev I. Hidden Communities in the Russian Social Network Corpus: a Comparative Study of Detection Methods / I. Mamaev, O. Mitrofanova // Proceedings of the Computational Models in Language and Speech Workshop (CMLS 2020) co-located with 16th International Conference on Computational and Cognitive Linguistics (TEL 2020). — 2020. — P. 69–78.	ru_RU
dc.description.abstract	The research premises are determined by the necessity of combining the methods of linguistic and quantitative data analysis to create a semantic network of users based on the topical similarity of their posts. The object of the research is hidden communities with the help of topic modeling algorithms among Russian users of VKontakte social network. The subject of the research is the quantitative and qualitative parameters of hidden communities, which make it possible to evaluate the current state of communities. The aim of the study is to detect hidden connections among users’ posts on social networks based on the similarity of linguistic parameters of the texts and their topics. Among our tasks to be solved, we include such tasks as collecting a corpus and its preprocessing, building author-topic models, developing a procedure for topic labelling and building a resultant graph of hidden communities. The theoretical basis of the thesis is the works of Russian and foreign linguists, namely the works of V.I. Karasik, A.A. Matusevich, O. N. Morozova on the study of the linguistic features of texts on the Internet, the works of V.F. Khoroshevsky, P.A. Meilakhs P, K. He, S. Fortunato on detecting hidden communities, the works of O.A. Mitrofanova O.A., A.V. Kriukova, A.S. Smolyarova, Q. Mei on topic modeling. The research material is the corpus of 25768 Russian posts of 714 users of VKontakte social network, the total size is 8679402 tokens. The thesis consists of four chapters, an introduction, a conclusion, a bibliography list, a list of electronic resources and an appendix. The main statements for the defense. 1. Algorithms of cluster analysis and graph methods highlight dispersed hidden communities, which is explained by the inadequate consideration of the semantics of the observed text collections. 2. A distinctive feature of the texts of posts is the use of non-normative spelling rules, which leads to difficulties during automatic text processing, therefore, the tokenization and normalization procedures have to be constantly improved. 3. Taking into account the topical component during the detection of hidden communities allows you to include a user in several hidden communities. The obtained results prove the consistency of the developed algorithm. In the conclusion, the main aspects of the experiments-to-be are given. The Master Thesis was approbated at AINL 2020 and TEL 2020 conferences, the main results are described in the following papers (Scopus). 1. Mamaev I. Automatic Detection of Hidden Communities in the Texts of Russian Social Network Corpus / I. Mamaev, O. Mitrofanova // Artificial Intelligence and Natural Language. AINL 2020. Communications in Computer and Information Science. — Vol. 1292. — Springer, Cham, 2020. — P. 17–33. 2. Mamaev I. Hidden Communities in the Russian Social Network Corpus: a Comparative Study of Detection Methods / I. Mamaev, O. Mitrofanova // Proceedings of the Computational Models in Language and Speech Workshop (CMLS 2020) co-located with 16th International Conference on Computational and Cognitive Linguistics (TEL 2020). — 2020. — P. 69–78.	en_GB
dc.language.iso	ru
dc.subject	корпусная лингвистика	ru_RU
dc.subject	автор-тематическое моделирование	ru_RU
dc.subject	скрытые сообщества	ru_RU
dc.subject	социальные сети	ru_RU
dc.subject	автоматическое назначение меток тем	ru_RU
dc.subject	corpus linguistics	en_GB
dc.subject	author-topic modeling	en_GB
dc.subject	hidden communities	en_GB
dc.subject	social networks	en_GB
dc.subject	automatic topic labeling	en_GB
dc.title	Automatic detection of hidden communities in the texts of the Russian social network corpus	en_GB
dc.title.alternative	Автоматическое выявление скрытых сообществ в текстах русскоязычного корпуса социальных сетей	ru_RU
Располагается в коллекциях:	MASTER'S STUDIES

Файлы этого ресурса:

Файл	Описание	Размер	Формат
VKR_Mamaev_finalnaa.docx	Article	2,55 MB	Microsoft Word XML	Просмотреть/Открыть
reviewSV_Otzyv_I_D_Mamaev.pdf	ReviewSV	246,3 kB	Adobe PDF	Просмотреть/Открыть

Показать базовое описание ресурса Просмотр статистики

Все ресурсы в архиве электронных ресурсов защищены авторским правом, все права сохранены.

Архив открытого доступаСанкт-Петербургского государственного университета

Архив открытого доступа
Санкт-Петербургского государственного университета