我院吕晓玲教授、王菲菲副教授在《Data Mining and Knowledge Discovery》发表论文。该研究主要探讨了动态文本数据的主题挖掘问题,研究中提出了一种全新的Topic-CD模型,该模型融合了传统的主题建模和变点挖掘,可以用于文本主题的变点监测,从而帮助使用者更好的理解动态文本中主题的变化规律。
论文题目
Topic Change-point Detection Using a Mixed Bayesian Model
文章摘要
Dynamic text documents, including news articles, user reviews, and blogs, are now commonly encountered in many fields. Accordingly, the topics underlying text streams also change over time. To grasp the topic changes in the increasing accumulation of text documents, there is a great need to develop automatic text analysis models to find the key changes in topics. To this end, this study proposes a topic change-point detection (Topic-CD) model. Different from previous studies, we define the change point of topics from the perspective of hyperparameters associated with topic-word distributions. This allows the model to detect change points underlying the whole topic set. Under this definition, the topic modeling and change point detection are combined in a unified framework and then performed simultaneously using a Markov chain Monte Carlo algorithm. In addition, the Topic-CD model is free from setting the number of change points in advance, which makes it more convenient for practical use. We investigate the performance of the Topic-CD model numerically using synthetic data and three real datasets. The results show that the Topic-CD model identifies the change points in topics well when compared with several state-of-the-art methods.
作者介绍
吕晓玲,中国人民大学统计学院教授、应用统计科学研究中心研究员。研究方向:统计学习、消费者行为分析、文本分析。
郭昱璇,中国人民大学统计学院在读博士生,主要研究方向为评论文本挖掘等。
陈嘉怡,本硕就读于中国人民大学统计学院,现就职于阿里巴巴智能信息平台算法工程师,算法方向为广告机制、反作弊。
王菲菲,中国人民大学统计学院副教授,研究上关注文本挖掘及其商业应用、社交网络分析、大数据建模等。
发表页面