Category Archives: Papers

New Publication: OODIDA: On-Board/Off-Board Distributed Real-Time Data Analytics for Connected Vehicles

My paper “OODIDA: On-Board/Off-Board Distributed Real-Time Data Analytics for Connected Vehicles” was recently published in the Springer journal Data Science and Engineering. The article has been made freely available via Open Access. The abstract is below.

A fleet of connected vehicles easily produces many gigabytes of data
per hour, making centralized (off-board) data processing impractical.
In addition, there is the issue of distributing tasks to on-board units
in vehicles and processing them efficiently. Our solution to this
problem is On-board/Off-board Distributed Data Analytics (OODIDA),
which is a platform that tackles both task distribution to connected
vehicles as well as concurrent execution of tasks on arbitrary subsets
of edge clients. Its message-passing infrastructure has been
implemented in Erlang/OTP, while the end points use a language-
independent JSON interface. Computations can be carried out in
arbitrary programming languages. The message-passing infrastructure of
OODIDA is highly scalable, facilitating the execution of large numbers
of concurrent tasks.

New Publication: Facilitating Rapid Prototyping in the OODIDA Data Analytics Platform via Active-Code Replacement

My paper “Facilitating Rapid Prototyping in the OODIDA Data Analytics Platform via Active-Code Replacement” was recently published in the Journal Array. The full paper is publicly available at no cost via the journal homepage: https://doi.org/10.1016/j.array.2020.100043. You can also read the preprint on arXiv: https://arxiv.org/abs/1903.09477.

Here is the abstract:

OODIDA (On-board/Off-board Distributed Data Analytics) is a platform for
distributed real-time analytics, targeting fleets of reference vehicles in the
automotive industry. Its users are data analysts. The bulk of the data
analytics tasks are performed by clients (on-board), while a central cloud
server performs supplementary tasks (off-board). OODIDA can be automatically
packaged and deployed, which necessitates restarting parts of the system, or
all of it. As this is potentially disruptive, we added the ability to execute
user-defined Python modules on clients as well as the server. These modules can
be replaced without restarting any part of the system; they can even be
replaced between iterations of an ongoing assignment. This feature is referred
to as active-code replacement. It facilitates use cases such as iterative A/B
testing of machine learning algorithms or modifying experimental algorithms
on-the-fly. Consistency of results is achieved by majority vote, which prevents
tainted state. Active-code replacement can be done in less than a second in an
idealized setting whereas a standard deployment takes many orders of magnitude
more time. The main contribution of this paper is the description of a
relatively straightforward approach to active-code replacement that is very
user-friendly. It enables a data analyst to quickly execute custom code on the
cloud server as well as on client devices. Sensible safeguards and design
decisions ensure that this feature can be used by non-specialists who are not
familiar with the implementation of OODIDA in general or this feature in
particular. As a consequence of adding the active-code replacement feature,
OODIDA is now very well-suited for rapid prototyping.

New Publication: S-RASTER: Contraction Clustering for Evolving Data Streams

My paper S-RASTER: Contraction Clustering for Evolving Data Streams was recently published in the Journal of Big Data. It is available via open access. (I prefer the formatting of the arXiv preprint, however.) Here is the abstract:

Abstract:
Contraction Clustering (RASTER) is a single-pass algorithm for density-based clustering of 2D data. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. RASTER exhibits very competitive performance compared to standard clustering algorithms, but at the cost of decreased precision. Yet, RASTER is limited to batch processing and unable to identify clusters that only exist temporarily. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step within a sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. Our evaluation shows that competing algorithms are at least 50% slower. Furthermore, S-RASTER shows good qualitative results, based on standard metrics. It is very well suited to real-world scenarios where clustering does not happen continually but only periodically.

New Paper Published: “Active-Code Replacement in the OODIDA Data Analytics Platform”

My paper “Active-Code Replacement in the OODIDA Data Analytics Platform” has recently been published in the conference proceedings volume “Euro-Par 2019: Parallel Processing Workshops”, LNCS 11997. If you have institutional access, you can access the paper on Springer Link. Otherwise, you will find a preprint on arXiv. A summary is below.

Title: Active-Code Replacement in the OODIDA Data Analytics Platform

Authors: Gregor Ulm, Email Gustavsson, Mats Jirstrand

Abstract:
OODIDA (On-board/Off-board Distributed Data Analytics) is a platform for distributing and executing concurrent data analytics tasks. It targets fleets of reference vehicles in the automotive industry and has a particular focus on rapid prototyping. Its underlying message-passing infrastructure has been implemented in Erlang/OTP. External Python applications perform data analytics tasks. Most work is performed by clients (on-board). A central cloud server performs supplementary tasks (off-board). OODIDA can be automatically packaged and deployed, which necessitates restarting parts of the system, or all of it. This is potentially disruptive. To address this issue, we added the ability to execute user-defined Python modules on clients as well as the server. These modules can be replaced without restarting any part of the system and they can even be replaced between iterations of an ongoing assignment. This facilitates use cases such as iterative A/B testing of machine learning algorithms or modifying experimental algorithms on-the-fly.

New Preprint: S-RASTER: Contraction Clustering for Evolving Data Streams

A preprint of our paper “S-RASTER: Contraction Clustering for Evolving Data Streams” is now available on arXiv. It describes an adaptation of RASTER, a very fast algorithm for detecting dense clusters in big data, to the stream processing paradigm. RASTER was designed for batch processing. In contrast, S-RASTER is able to detect clusters within sliding windows of a data stream, which is particularly relevant for real-time data processing. The abstract is reproduced below.

Title: S-RASTER: Contraction Clustering for Evolving Data Streams 

Authors: Gregor Ulm, Simon Smith, Adrian Nilsson,
Emil Gustavsson, Mats Jirstrand 

Abstract:  Contraction Clustering (RASTER) is a very fast algorithm
for density-based clustering, which requires only a single pass. It
can process arbitrary amounts of data in linear time and in constant
memory, quickly identifying approximate clusters. It also exhibits
good scalability in the presence of multiple CPU cores. Yet, RASTER
is limited to batch processing. In contrast, S-RASTER is an
adaptation of RASTER to the stream processing paradigm that is able
to identify clusters in evolving data streams. This algorithm
retains the main benefits of its parent algorithm, i.e. single-pass
linear time cost and constant memory requirements for each discrete
time step in the sliding window. The sliding window is efficiently
pruned, and clustering is still performed in linear time. Like
RASTER, S-RASTER trades off an often negligible amount of precision
for speed. It is therefore very well suited to real-world scenarios
where clustering does not happen continually but only periodically.
We describe the algorithm, including a discussion of implementation
details.