A preprint of our paper “S-RASTER: Contraction Clustering for Evolving Data Streams” is now available on arXiv. It describes an adaptation of RASTER, a very fast algorithm for detecting dense clusters in big data, to the stream processing paradigm. RASTER was designed for batch processing. In contrast, S-RASTER is able to detect clusters within sliding windows of a data stream, which is particularly relevant for real-time data processing. The abstract is reproduced below.
Title: S-RASTER: Contraction Clustering for Evolving Data Streams Authors: Gregor Ulm, Simon Smith, Adrian Nilsson, Emil Gustavsson, Mats Jirstrand Abstract: Contraction Clustering (RASTER) is a very fast algorithm for density-based clustering, which requires only a single pass. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. Yet, RASTER is limited to batch processing. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step in the sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. It is therefore very well suited to real-world scenarios where clustering does not happen continually but only periodically. We describe the algorithm, including a discussion of implementation details.