Category Archives: Papers

New Paper Published: “Active-Code Replacement in the OODIDA Data Analytics Platform”

My paper “Active-Code Replacement in the OODIDA Data Analytics Platform” has recently been published in the conference proceedings volume “Euro-Par 2019: Parallel Processing Workshops”, LNCS 11997. If you have institutional access, you can access the paper on Springer Link. Otherwise, you will find a preprint on arXiv. A summary is below.

Title: Active-Code Replacement in the OODIDA Data Analytics Platform

Authors: Gregor Ulm, Email Gustavsson, Mats Jirstrand

Abstract:
OODIDA (On-board/Off-board Distributed Data Analytics) is a platform for distributing and executing concurrent data analytics tasks. It targets fleets of reference vehicles in the automotive industry and has a particular focus on rapid prototyping. Its underlying message-passing infrastructure has been implemented in Erlang/OTP. External Python applications perform data analytics tasks. Most work is performed by clients (on-board). A central cloud server performs supplementary tasks (off-board). OODIDA can be automatically packaged and deployed, which necessitates restarting parts of the system, or all of it. This is potentially disruptive. To address this issue, we added the ability to execute user-defined Python modules on clients as well as the server. These modules can be replaced without restarting any part of the system and they can even be replaced between iterations of an ongoing assignment. This facilitates use cases such as iterative A/B testing of machine learning algorithms or modifying experimental algorithms on-the-fly.

New Preprint: S-RASTER: Contraction Clustering for Evolving Data Streams

A preprint of our paper “S-RASTER: Contraction Clustering for Evolving Data Streams” is now available on arXiv. It describes an adaptation of RASTER, a very fast algorithm for detecting dense clusters in big data, to the stream processing paradigm. RASTER was designed for batch processing. In contrast, S-RASTER is able to detect clusters within sliding windows of a data stream, which is particularly relevant for real-time data processing. The abstract is reproduced below.

Title: S-RASTER: Contraction Clustering for Evolving Data Streams 

Authors: Gregor Ulm, Simon Smith, Adrian Nilsson,
Emil Gustavsson, Mats Jirstrand 

Abstract:  Contraction Clustering (RASTER) is a very fast algorithm
for density-based clustering, which requires only a single pass. It
can process arbitrary amounts of data in linear time and in constant
memory, quickly identifying approximate clusters. It also exhibits
good scalability in the presence of multiple CPU cores. Yet, RASTER
is limited to batch processing. In contrast, S-RASTER is an
adaptation of RASTER to the stream processing paradigm that is able
to identify clusters in evolving data streams. This algorithm
retains the main benefits of its parent algorithm, i.e. single-pass
linear time cost and constant memory requirements for each discrete
time step in the sliding window. The sliding window is efficiently
pruned, and clustering is still performed in linear time. Like
RASTER, S-RASTER trades off an often negligible amount of precision
for speed. It is therefore very well suited to real-world scenarios
where clustering does not happen continually but only periodically.
We describe the algorithm, including a discussion of implementation
details. 

New Preprint: Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass

A preprint of our paper “Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass” is now available on arXiv. The abstract is reproduced below:

Title: Contraction Clustering (RASTER): A Very Fast Big Data
Algorithm for Sequential and Parallel Density-Based Clustering
in Linear Time, Constant Memory, and a Single Pass

Authors: Gregor Ulm, Simon Smith, Adrian Nilsson,
Emil Gustavsson, Mats Jirstrand

Abstract: Clustering is an essential data mining tool for
analyzing and grouping similar objects. In big data
applications, however, many clustering algorithms are
infeasible due to their high memory requirements and/or
unfavorable runtime complexity. In contrast, Contraction
Clustering (RASTER) is a single-pass algorithm for identifying
density-based clusters with linear time complexity. Due to its
favorable runtime and the fact that its memory requirements
are constant, this algorithm is highly suitable for big data
applications where the amount of data to be processed is huge.
It consists of two steps: (1) a contraction step which
projects objects onto tiles and (2) an agglomeration step
which groups tiles into clusters. This algorithm is extremely
fast in both sequential and parallel execution. In single-
threaded execution on a contemporary workstation, an 
implementation in Rust processes a batch of 500 million points 
with 1 million clusters in less than 50 seconds. The speedup 
due to parallelization is significant, amounting to a factor 
of around 4 on an 8-core machine. 

New Preprint: Active-Code Replacement in the OODIDA Data Analytics Platform

A preprint of our paper “Active-Code Replacement in the OODIDA Data Analytics Platform” is now available on arXiv. It describes a key feature of OODIDA, which is a distributed system for data analytics for the automotive industry, targeting a fleet of reference vehicles. With active-code reloading, it is possible to replace code for custom computations without taking any component of the system down. The abstract is reproduced below:

Active-Code Replacement in the OODIDA Data Analytics Platform
Gregor Ulm, Emil Gustavsson, Mats Jirstrand

OODIDA (On-board/Off-board Distributed Data Analytics) is a
platform for distributing and executing concurrent data analysis
tasks. It targets a fleet of reference vehicles in the
automotive industry and has a particular focus on rapid
prototyping. Its underlying message-passing infrastructure has
been implemented in Erlang/OTP, but the external applications
for user interaction and carrying out data analysis tasks use
a language-independent JSON interface. These applications are
primarily implemented in Python. A data analyst interacting with
OODIDA uses a Python library. The bulk of the data analytics
tasks are performed by clients (on-board), while a central server
performs supplementary tasks (off-board). OODIDA can be
automatically packaged and deployed, which necessitates restarting
parts of the system, or all of it. This is potentially disruptive.
To address this issue, we added the ability to execute
user-defined Python modules on both the client and the server,
which can be replaced without restarting any part of the system.
Modules can even be swapped between iterations of an ongoing
assignment. This facilitates use cases such as iterative A/B
testing of machine learning algorithms or deploying experimental
algorithms on-the-fly. Active-code replacement is a key feature
of our system as well as an example of interoperability between
a functional and a non-functional programming language.

New Preprint: OODIDA: On-board/Off-board Distributed Data Analytics for Connected Vehicles

A preprint of our paper “OODIDA: On-board/Off-board Distributed Data Analytics for Connected Vehicles” is now available on arXiv. It describes a distributed system for data analytics for the automotive industry, targeting a fleet of reference vehicles. The abstract is reproduced below:

OODIDA: On-board/Off-board Distributed Data Analytics
for Connected Vehicles
Gregor Ulm, Emil Gustavsson, and Mats Jirstrand

Connected vehicles may produce gigabytes of data per
hour, which makes centralized data processing
impractical at the fleet level. In addition, there
are the problems of distributing tasks to edge
devices and processing them efficiently. Our solution
to this problem is OODIDA (On-board/off-board
Distributed Data Analytics), which is a platform that
tackles both task distribution to connected vehicles
as well as concurrent execution of large-scale tasks
on arbitrary subsets of clients. Its message-passing
infrastructure has been implemented in Erlang/OTP,
while the end points are language-agnostic. OODIDA is
highly scalable and able to process a significant
volume of data on resource-constrained clients.