20200605王菲菲:Sequential Text Term Selection in Vector Space Models
报告时间:2020 / 06 / 05(周五)15:00
报告形式:腾讯会议
报告嘉宾:王菲菲
报告主题:Sequential Text Term Selection in Vector Space Models
报告摘要:
Text mining has recently attracted a great deal of attention with the accumulation of text documents in all fields. In this paper, we focus on the use of textual information to explain continuous variables in the framework of linear regressions. To handle the unstructured texts, one common practice is to structuralize the text documents via vector space models. However, using words or phrases as the basic analysis terms in vector space models is in high debate. In addition, vector space models often lead to an extremely large term set and suffer from the curse of dimensionality, which makes term selection important and necessary. Toward this end, we propose a novel term screening method for vector space models under a linear regression setup. We first split the entire term space into different subspaces according to the length of terms and then conduct term screening in a sequential manner. We prove the screening consistency of the method and assess the empirical performance of the proposed method with simulations based on a dataset of online consumer reviews for cellphones. Then, we analyze the associated real data. The results show that the sequential term selection technique can effectively detect the relevant terms by a few steps.
个人简介:
王菲菲,中国人民大学统计学院助理教授,北京大学光华管理学院统计学博士。研究上关注文本挖掘及其商业应用、大数据建模、空间统计学、社交网络分析等,在Journal of Econometrics、Journal of Business and Economic Statistics、Statistics in Medicine等期刊上均有发表。
主持人简介:
孙怡帆,中国人民大学统计学院副教授,博士生导师,概率论与数理统计教研室主任,全国工业统计学教学研究会第九届理事会理事。主要研究方向为复杂数据分析、网络分析、最优化方法等,在Statistics in Medicine、统计研究等学术期刊发表论文20余篇。主持国家和省部级等项目6项。