My first big project at the Fraunhofer Chalmers Research Centre for Industrial Mathematics centered around processing huge geospatial data sets. The industrial partner was Scania. The main part of my contribution consisted of generating labelled evolving network graphs that make it possible to trace the routes of vehicles over time. During that work, I discovered a linear-time clustering algorithm (Contraction Clustering, which is itself contracted to RASTER) for detecting clusters with dense centers. Those centers are nodes in the aforementioned graphs. The key novelty of my work is the exploitation of domain knowledge, which facilitates clustering with just one pass through the entire input, neighborhood lookup in O(1), and constant memory requirements.
My paper on RASTER got accepted to MOD 2017, which will take place from 14 to 17 September in Volterra, Italy. The abstract of my paper is below.
Contraction Clustering (RASTER): A Big Data Algorithm for Density-Based Clustering in Constant Memory and Linear Time
Authors: Ulm, Gregor; Emil Gustavsson, Mats Jirstrand
Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering methods are infeasible due to their memory requirements or runtime complexity. Contraction Clustering (RASTER) is a linear-time algorithm for identifying density-based clusters. Its coefficient is negligible as it depends neither on input size nor the number of clusters. Its memory requirements are constant. Consequently, RASTER is suitable for big data applications where the size of the data may be huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. Our algorithm is extremely fast. In single-threaded execution on a contemporary workstation, it clusters ten million points in less than 20 seconds when using a slow interpreted programming language like Python. Furthermore, RASTER is easily parallelizable.