我院吕晓玲教授、王菲菲副教授在《Information Sciences》发表论文。该研究在联动主题模型JDTM(Zhu et al., 2022)的基础上进行了两种改进:(1)通过引入稀疏机制降低待估参数数量,在提升计算效率的同时获得更有解释性的主题,(2)将滞后阶数作为待估参数,从而增加了模型的实际适用性。
论文题目
Bayesian sparse joint dynamic topic model with flexible lead-lag order
文章摘要
Currently, text documents from multiple sources have become available in many fields. It is of great interest to study the relationship between documents from different sources and uncover the underlying causality. Zhu et al. (2021) proposed a joint dynamic topic model (JDTM). They classified all topics into three groups and used the “shared topics” with a fixed time lag order to characterize the shared information between two corpora. Although JDTM is a powerful tool for discovering the lead-lag relationship, there are two potential shortcomings. First, different shared topics should have distinct meanings, which should lead to different time lag orders between the two corpora. Second, for dynamic documents, not all topics are represented in each time slice, and thus topic sparsity should be considered. To address these two problems, we propose a sparse joint dynamic topic model (SJDTM) with a flexible lead-lag order. We assume a birth-and-death mechanism for all topics and a flexible lead-lag order for different shared topics. The performance of SJDTM is evaluated using both synthetic data and two real text corpora consisting of conference papers and journal papers.
作者介绍
王菲菲,中国人民大学统计学院副教授,研究上关注文本挖掘及其商业应用、社交网络分析、大数据建模等,研究论文发表于Journal of Econometrics, Journal of Business and Econometric Statistics, Journal of Machine Learning Research, 中国科学(数学)等国内外高水平期刊上。主持并参与了国家自科基金项目、教育部社科重大项目、国家重点研发项目等多个课题。
周睿,中国人民大学统计学院博士生。
冯艺超,中国人民大学统计学院硕士毕业生,现就职于京东。
吕晓玲,中国人民大学统计学院教授、应用统计科学研究中心研究员。研究方向:统计学习、消费者行为分析、文本分析。
论文发表截图