Checkpoint Report

April 25, 2017

Anchor 1

Project Progress

We are in the process of analyzing and comparing four existing dynamic memory allocators for multi-threaded environments, which are:

jemalloc: jemalloc is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support. jemalloc doesn’t use sbrk() which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. It also has thread specific caching.
TCMalloc: Each thread is assigned a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.
Hoard: It is a fast, highly scalable allocator that largely avoids false sharing and is memory efficient. Hoard combines one global heap and per-processor heaps with a novel discipline that provably bounds memory consumption and has very low synchronization costs in the common case.
ptmalloc3: multi-threaded version of glibc malloc().

We have identified and are using the following benchmarks for the analysis of the above-mentioned allocators:

Linux-Scalability: Each thread performs 10 million malloc/free pairs of 8 byte blocks in a tight loop. This benchmark aims at testing allocator latency and scalability.
Threadtest: Each thread performs 100 iterations of allocating 100,000 8-byte blocks and then freeing them in order. This benchmark is also used to determine the latency and scalability of the allocator.
Larson: Initially one thread allocates and frees random sized blocks (16 to 80 bytes) in random order, then an equal number of blocks (1024) is handed over to each of the remaining threads. In the parallel phase, which lasts 30 seconds, each thread randomly selects a block and frees it, then allocates a new random-sized block in its place. The benchmark measures how many free/malloc pairs are performed during the parallel phase. Larson captures the robustness of malloc’s latency and scalability under irregular allocation patterns with respect to block-size and order of deallocation over a long period.

The project comprises of two parts, 1) Analyzing the existing dynamic multi-threaded memory allocators and 2) Implementing a lock-free dynamic multi-threaded memory allocator. We have partially completed the first part of the project i.e. run and analyze existing dynamic memory allocators. We have analyzed jemalloc, ptmalloc3 and hoard. At present, we are having some issues in some test cases for analyzing tcmalloc and are working on it.

As per the initially identified schedule, we were supposed to be done with the analysis of the existing dynamic memory allocators, but due to the issues we are facing at this point, the milestone is deferred.

We aim at starting the implementation of the originally planned lock-free dynamic allocator next.

We need to understand the intrinsics and the criticalities involved in the thought of design for the lock-free dynamic memory allocator. We hope to be able to implement and demonstrate the same by the end of the project.

We expect to deliver a lock-free dynamic memory allocator, analysis and comparison (in terms of performance) of the existing memory allocator implementations with the lock-free dynamic memory allocator we will implement. The analysis will be in the form of graphs that would provide a comparison of the same.

We have some preliminary statistics of the existing memory allocators for which we ran the above-mentioned benchmarks. A subset of the statistics is represented in the form of graphs in Figures 1 – 8.

Thread test results

Anchor 2

Anchor 4

Larson test results

Anchor 3

challenges

We foresee the following challenges for the project at this point:

Unclear design decisions for the lock-free dynamic memory allocator implementation.
Measuring false sharing. We need to figure out a strong matrix to analyze false sharing performance of the dynamic memory allocators.

Anchor 5