New Preprint: S-RASTER: Contraction Clustering for Evolving Data Streams

A preprint of our paper “S-RASTER: Contraction Clustering for Evolving Data Streams” is now available on arXiv. It describes an adaptation of RASTER, a very fast algorithm for detecting dense clusters in big data, to the stream processing paradigm. RASTER was designed for batch processing. In contrast, S-RASTER is able to detect clusters within sliding windows of a data stream, which is particularly relevant for real-time data processing. The abstract is reproduced below.

Title: S-RASTER: Contraction Clustering for Evolving Data Streams 

Authors: Gregor Ulm, Simon Smith, Adrian Nilsson,
Emil Gustavsson, Mats Jirstrand 

Abstract:  Contraction Clustering (RASTER) is a very fast algorithm
for density-based clustering, which requires only a single pass. It
can process arbitrary amounts of data in linear time and in constant
memory, quickly identifying approximate clusters. It also exhibits
good scalability in the presence of multiple CPU cores. Yet, RASTER
is limited to batch processing. In contrast, S-RASTER is an
adaptation of RASTER to the stream processing paradigm that is able
to identify clusters in evolving data streams. This algorithm
retains the main benefits of its parent algorithm, i.e. single-pass
linear time cost and constant memory requirements for each discrete
time step in the sliding window. The sliding window is efficiently
pruned, and clustering is still performed in linear time. Like
RASTER, S-RASTER trades off an often negligible amount of precision
for speed. It is therefore very well suited to real-world scenarios
where clustering does not happen continually but only periodically.
We describe the algorithm, including a discussion of implementation
details. 

Upcoming Poster Presentation at Euro-Par 2019

Our paper “Active-Code Replacement in the OODIDA Data Analytics Platform” has been accepted to the poster session at Euro-Par 2019, which will take place from August 26 to 30 in Goettingen, Germany. The abstract is reproduced below. A preprint of the paper the poster is based on is available on arXiv, where you also find a preprint of the paper on OODIDA.

Title: Active-Code Replacement in the \OODIDA Data Analytics Platform
Authors: Gregor Ulm, Emil Gustavsson, Mats Jirstrand 

OODIDA (On-board/Off-board Distributed Data Analytics) is a
platform for distributing and executing concurrent data analytics
tasks. It targets fleets of reference vehicles in the automotive
industry and has a particular focus on rapid prototyping. Its
underlying message-passing infrastructure has been implemented in
Erlang/OTP. External Python applications perform data analytics
tasks. Most work is performed by clients (on-board). A central
server performs supplementary tasks (off-board). OODIDA can be
automatically packaged and deployed, which necessitates restarting
parts of the system, or all of it. This is potentially disruptive. 
To address this issue, we added the ability to execute user-
defined Python modules on clients as well as the server. These 
modules can be replaced without restarting any part of the system 
and they can even be replaced between iterations of an ongoing 
assignment. This facilitates use cases such as iterative A/B
testing of machine learning algorithms or deploying experimental
algorithms on-the-fly.

New Preprint: Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass

A preprint of our paper “Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass” is now available on arXiv. The abstract is reproduced below:

Title: Contraction Clustering (RASTER): A Very Fast Big Data
Algorithm for Sequential and Parallel Density-Based Clustering
in Linear Time, Constant Memory, and a Single Pass

Authors: Gregor Ulm, Simon Smith, Adrian Nilsson,
Emil Gustavsson, Mats Jirstrand

Abstract: Clustering is an essential data mining tool for
analyzing and grouping similar objects. In big data
applications, however, many clustering algorithms are
infeasible due to their high memory requirements and/or
unfavorable runtime complexity. In contrast, Contraction
Clustering (RASTER) is a single-pass algorithm for identifying
density-based clusters with linear time complexity. Due to its
favorable runtime and the fact that its memory requirements
are constant, this algorithm is highly suitable for big data
applications where the amount of data to be processed is huge.
It consists of two steps: (1) a contraction step which
projects objects onto tiles and (2) an agglomeration step
which groups tiles into clusters. This algorithm is extremely
fast in both sequential and parallel execution. In single-
threaded execution on a contemporary workstation, an 
implementation in Rust processes a batch of 500 million points 
with 1 million clusters in less than 50 seconds. The speedup 
due to parallelization is significant, amounting to a factor 
of around 4 on an 8-core machine. 

New Preprint: Active-Code Replacement in the OODIDA Data Analytics Platform

A preprint of our paper “Active-Code Replacement in the OODIDA Data Analytics Platform” is now available on arXiv. It describes a key feature of OODIDA, which is a distributed system for data analytics for the automotive industry, targeting a fleet of reference vehicles. With active-code reloading, it is possible to replace code for custom computations without taking any component of the system down. The abstract is reproduced below:

Active-Code Replacement in the OODIDA Data Analytics Platform
Gregor Ulm, Emil Gustavsson, Mats Jirstrand

OODIDA (On-board/Off-board Distributed Data Analytics) is a
platform for distributing and executing concurrent data analysis
tasks. It targets a fleet of reference vehicles in the
automotive industry and has a particular focus on rapid
prototyping. Its underlying message-passing infrastructure has
been implemented in Erlang/OTP, but the external applications
for user interaction and carrying out data analysis tasks use
a language-independent JSON interface. These applications are
primarily implemented in Python. A data analyst interacting with
OODIDA uses a Python library. The bulk of the data analytics
tasks are performed by clients (on-board), while a central server
performs supplementary tasks (off-board). OODIDA can be
automatically packaged and deployed, which necessitates restarting
parts of the system, or all of it. This is potentially disruptive.
To address this issue, we added the ability to execute
user-defined Python modules on both the client and the server,
which can be replaced without restarting any part of the system.
Modules can even be swapped between iterations of an ongoing
assignment. This facilitates use cases such as iterative A/B
testing of machine learning algorithms or deploying experimental
algorithms on-the-fly. Active-code replacement is a key feature
of our system as well as an example of interoperability between
a functional and a non-functional programming language.