我院吕晓玲教授、王菲菲副教授在《Data Mining and Knowledge Discovery》发表论文。该研究主要探讨了两种带有联动作用的文本主题建模问题。研究中提出了一种全新的Joint Dynamic Topic Model(JDTM)模型,该模型在传统的动态主题模型的基础上,通过建立两种文本在时间上的前后关联性,可以挖掘两类文本的“超前-滞后”作用(lead-lag relationship),该模型同时有助于对滞后文本预测其未来的主题变化。
论文题目
Joint Dynamic Topic Model for Recognition of Lead-lag Relationship in Two Text Corpora
文章摘要
Topic evolution modeling has received significant attentions in recent decades. Although various topic evolution models have been proposed, most studies focus on the single document corpus. However in practice, we can easily access data from multiple sources and also observe relationships between them. Then it is of great interest to recognize the relationship between multiple text corpora and further utilize this relationship to improve topic modeling. In this work, we focus on a special type of relationship between two text corpora, which we define as the "lead-lag relationship". This relationship characterizes the phenomenon that one text corpus would influence the topics to be discussed in the other text corpus in the future. To discover the lead-lag relationship, we propose a joint dynamic topic model and also develop an embedding extension to address the modeling problem of large-scale text corpus. With the recognized lead-lag relationship, the similarities of the two text corpora can be figured out and the quality of topic learning in both corpora can be improved. We numerically investigate the performance of the joint dynamic topic modeling approach using synthetic data. Finally, we apply the proposed model on two text corpora consisting of statistical papers and the graduation theses. Results show the proposed model can well recognize the lead-lag relationship between the two corpora, and the specific and shared topic patterns in the two corpora are also discovered.
作者介绍
朱彦頔,北京大学经济学院在读博士生。
吕晓玲,中国人民大学统计学院教授、应用统计科学研究中心研究员。研究方向:统计学习、消费者行为分析、文本分析。
洪婧娅,中国人民大学统计学院硕士生。
王菲菲,中国人民大学统计学院副教授,研究上关注文本挖掘及其商业应用、社交网络分析、大数据建模等,研究论文发表于Journal of Econometrics, Journal of Business and Econometric Statistics, Journal of Machine Learning Research, 中国科学(数学)等国内外高水平期刊上。主持并参与了国家自科基金项目、教育部社科重大项目、国家重点研发项目等多个课题。
论文发表截图