2021

Research / 2021

Research

Bayesian Text Classification and Summarization via A Class-Specified Topic Model

2021.01.01

Publication Time2021.01.01

Lead AuthorFeifei Wang

Corresponding AuthorFeifei Wang

Journal】 JOURNAL OF MACHINE LEARNING RESEARCH


Abstract

We propose the class-specified topic model (CSTM) to deal with the tasks of text classification and class-specific text summarization. The model assumes that in addition to a set of latent topics that are shared across classes, there is a set of class-specific latent topics for each class. Each document is a probabilistic mixture of the class-specific topics associated with its class and the shared topics. Each class-specific or shared topic has its own probability distribution over a given dictionary. We develop a Bayesian inference of CSTM in the semisupervised scenario, with the supervised scenario as a special case. We analyze in detail the 20 Newsgroups dataset, a benchmark dataset for text classification, and demonstrate that CSTM has better performance than a two stage approach based on latent Dirichlet allocation (LDA), several existing supervised extensions of LDA, and an L1L1 penalized logistic regression. The favorable performance of CSTM is also demonstrated through Monte Carlo simulations and an analysis of the Reuters dataset.


Keywords

Text Mining, Latent Topic, Semisupervised Classification, L  Penalization