Contraction Clustering (RASTER) paper published

Our paper “Contraction Clustering (Raster): A Big Data Algorithm for Density-Based Clustering in Constant Memory and Linear Time” has been published in the Springer Lecture Notes in Computer Science series.

Here is the full citation with doi:

Ulm G., Gustavsson E., Jirstrand M. (2018) Contraction Clustering (Raster). In: Nicosia G., Pardalos P., Giuffrida G., Umeton R. (eds) Machine Learning, Optimization, and Big Data. MOD 2017. Lecture Notes in Computer Science, vol 10710. Springer, Cham
https://doi.org/10.1007/978-3-319-72926-8_6

Alternatively, the submitted manuscript is available in my Gitlab repository:
https://gitlab.com/gregor_ulm/publications

Accepted Abstract for SweDS 2017: Functional Federated Learning in Erlang

At IFL 2017 in Bristol, UK, I presented our work-in-progress paper “Purely Functional Federated Learning in Erlang.” An update to this work will be presented as a poster at the upcoming 5th Swedish Workshop on Data Science (SweDS 2017), which will take place from 12 to 13 December 2017 in Gothenburg, Sweden. It is hosted by the University of Gothenburg. The abstract is reproduced below.

Functional Federated Learning in Erlang

Authors: G. Ulm, E. Gustavsson, and M. Jirstrand

A modern connected car produces gigabytes to terabytes of data per day. Collecting data generated by an entire fleet of cars, and processing it centrally on a server farm, is thus not feasible. The problem is that the total amount of data generated by cars, i.e. on edge devices, is too large to be efficiently transmitted to a central server. However, CPUs used in edge devices such as connected cars but also regular smartphones that connect to the cloud, have been getting more and more powerful in recent years. Tapping into this computational resource is one way of addressing the problem of processing big data that is generated by large numbers of edge devices.

One such approach consists of distributed data processing. Using the example of training an Artificial Neural Network, we introduce a framework for distributed data processing. A particular focus is on the implementation language Erlang. Arguably the biggest strength of the functional programming language Erlang is how straightforward it is to implement concurrent and distributed programs with it. Numerical computing, on the other hand, is not necessarily seen as one of its strengths.

The recent introduction of Federated Learning, a concept according to which edge devices are leveraged for decentralized machine learning tasks, while a central server only updates and distributes a global model, provides the motivation for exploring how well Erlang is suited to such a use case. We present a framework for Federated Learning in Erlang, written in a purely functional style. Erlang is used for coordinating data processing tasks but also for performing numerical computations. Initial results show that Erlang is well-suited for that kind of task.

We provide an overview of the general framework and also discuss an existing and fully realized in-house prototypical implementation that performs distributed machine learning tasks according to the Federated Learning paradigm. While we focus on Artificial Neural Networks, our Federated Learning framework is of a more general nature and could also be used with other machine learning algorithms.

The novelty of our work is that we present the first publicly available implementation of a Federated Learning framework; our work is also the first implementation of Federated Learning in a functional programming language, with the added benefit of being purely functional. In addition, we demonstrate that Erlang can not only be leveraged for message passing but that it also performs adequately for practical machine learning tasks.

Our presentation is based on our work-in-progress paper “Purely Functional Federated Learning in Erlang”, which we presented at IFL 2017. The context of this research is our ongoing involvement in the Vinnova-funded project "On-board/off-board distributed data analysis" (OODIDA), which is a joint-project between the Fraunhofer-Chalmers Research Centre for Industrial Mathematics, Chalmers University of Technology, Volvo Car Corporation, Volvo Trucks, and Alkit Communications.

Contraction Clustering (RASTER): paper and reference implementations

One of the results of my work at the Fraunhofer-Chalmers Research Centre for Industrial Mathematics (FCC) was the discovery of a very fast special-purpose linear-time clustering algorithm, Contraction Clustering (RASTER). I presented our work at the Third International Conference on Machine Learning, Optimization and Big Data (MOD 2017) in Volterra, Italy, earlier this month.

At FCC we decided to create reference implementations in a number of programming languages. You will find reference implementations of RASTER in Python, Erlang, Haskell, and Scala in my GitLab repository. Note that this is a mirror of the official FCC source code release.

Our paper will appear in the Proceedings of MOD 2017, which are part of the Springer Lecture Notes in Computer Science, early next year. The submitted self-archived manuscript of our RASTER paper is likewise available on my GitLab account.

Accepted Paper for IFL 2017: Purely Functional Federated Learning in Erlang

My current work at the Fraunhofer-Chalmers Research Centre for Industrial Mathematics focuses on distributed data analytics in a large-scale industrial setting. We are closely collaborating with Volvo Cars and Volvo Trucks. As part of my work, I explored the suitability of Erlang for distributed machine learning tasks. I wrote up a draft paper on my implementation of Federated Learning in Erlang, which got accepted to IFL 2017, the 29th symposium on Implementation and Application of Functional Languages. It will take place in Bristol, UK, from 30 August to 1 September, 2017. The abstract is reproduced below.


Purely Functional Federated Learning in Erlang

Authors: Ulm, Gregor; Emil Gustavsson, Mats Jirstrand

Arguably the biggest strength of the functional programming language Erlang is how straightforward it is to implement concurrent and distributed programs with it. Numerical computing, on the other hand, is not necessarily seen as one of its strengths. The recent introduction of Federated Learning, a concept according to which edge devices are leveraged for decentralized machine learning tasks, while a central server only updates and distributes a global model, provided the motivation for exploring how well Erlang was suited to such a use case. We present a framework for Federated Learning in Erlang, written in a purely functional style, and compare two versions of it: one that has been exclusively written in Erlang, and one in which Erlang is relegated to coordinating client processes that rely on performing numerical computations in the programming language C. Initial results are promising, as we learnt that a real-world industrial use case of distributed data analytics can easily be tackled with a system purely written in Erlang.
The novelty of our work is that we present the first implementation of a Federated Learning framework in a functional programming language, with the added benefit of being purely functional. In addition, we demonstrate that Erlang can not only be leveraged for message passing but also performs adequately for practical machine learning tasks.

Accepted Paper for MOD 2017: Contraction Clustering (RASTER): A Big Data Algorithm for Density-Based Clustering in Constant Memory and Linear Time

My first big project at the Fraunhofer Chalmers Research Centre for Industrial Mathematics centered around processing huge geospatial data sets. The industrial partner was Scania. The main part of my contribution consisted of generating labelled evolving network graphs that make it possible to trace the routes of vehicles over time. During that work, I discovered a linear-time clustering algorithm (Contraction Clustering, which is itself contracted to RASTER) for detecting clusters with dense centers. Those centers are nodes in the aforementioned graphs. The key novelty of my work is the exploitation of domain knowledge, which facilitates clustering with just one pass through the entire input, neighborhood lookup in O(1), and constant memory requirements.

My paper on RASTER got accepted to MOD 2017, which will take place from 14 to 17 September in Volterra, Italy. The abstract of my paper is below.

Contraction Clustering (RASTER): A Big Data Algorithm for Density-Based Clustering in Constant Memory and Linear Time

Authors: Ulm, Gregor; Emil Gustavsson, Mats Jirstrand

Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering methods are infeasible due to their memory requirements or runtime complexity. Contraction Clustering (RASTER) is a linear-time algorithm for identifying density-based clusters. Its coefficient is negligible as it depends neither on input size nor the number of clusters. Its memory requirements are constant. Consequently, RASTER is suitable for big data applications where the size of the data may be huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. Our algorithm is extremely fast. In single-threaded execution on a contemporary workstation, it clusters ten million points in less than 20 seconds when using a slow interpreted programming language like Python. Furthermore, RASTER is easily parallelizable.