Clustering Based on Kolmogorov-Smirnov Statistic with Application to Bank Card Transaction Data
2021.03.01[Publication Time] 2021-03-01
[Lead Author] 朱映秋
[Corresponding Author] 黄丹阳;张波
[Journal] JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS
[Abstract]
Rapid developments in third-party online payment platforms now make it possible to record massive bank card transaction data. Clustering on such transaction data is of great importance for the analysis of merchant behaviours. However, traditional methods based on generated features inevitably lead to much loss of information. To make better use of bank card transaction data, this study investigates the possibility of using the empirical cumulative distribution of transaction amounts. As the distance between two merchants can be measured using the two-sample Kolmogorov–Smirnov test statistic, we propose the Kolmogorov–Smirnov K-means clustering approach based on this distance measure. An approximation step is conducted to ensure the feasibility of the proposed method even for large-scale transaction data, and the associated theoretical properties are investigated. Both simulations and an empirical study demonstrate that our method outperforms feature-based methods and is computationally efficient for large-scale data sets.
[Keywords]
Empirical cumulative distribution function; K-means clustering; Kolmogorov–Smirnov test; Sampling