15618 Final Project by autumnust

Project Proposals

Introduction

We are aiming at implementing the concurrent hashmap using hopscotch hashing technique, one of the modern hashing techniques popular in academia. The idea is inspired by former researchers' result on Cuckoo hashing techniques, which implement Cuckoo hashing for read-dominant workload and integrated with Memcached. As Hopsotch hashing idea is published more recently and claims to have better overall performance on sequential version, we are driven by the curiosity of discovering the performance of its concurrent version, comparing with Cuckoo hashing and other popular concurrent Hashmap implementations.

The project interests us in two perspectives:

As Hopscotch hashing claims to have better performance on sequential version and delivering higher throughput than Cuckoo hashing, we would like to explore if concurrent implementation based on Hopscotch hashing can still have better scalability and throughput, comparing with Cuckoo Hashing.
Retrospectively, the whole parallel course can be summarized into three folds:
- Parallel architecture in different abstraction level and programming diagrams
- Implementation details on Parallel architecture
- Parallel architecture application
We believe our assignment has covered the first point very well and we do benefit a lot through it. While the second part, which is equally important, we don't have much hands-on practice opportunities through the course. We would like to the take the advantage of final project to explore topics on fine-grained locking, lock-free data structure, etc.

Background

Unlike Cuckoo hashing which combines two different hash function, there's only single hash function in Hopscotch hashing. Items hashed to the same entry will be resolved in the following way: If the entry is empty, the item will be stored there. If not, it will be rehashed into somewhere within H distance from the original place that hashed to, where H is a constant. Every entry has hop-information associated with it: a H-bit bitmap, indicating if there are item that original hashed to current entry stored in the next H neighboring entries. The candidate-searching for collision resolution is by linear probing method, we record the first available position as j. If j is not within H range, we need to find an item y which lies between j and original hashed position i. We displacing y to j to creates a new slot closer to i. Repeat this process, until no empty slots available, we will do the rehashing.

Goals, Deliverable and Schedule

Here are the deliverables that we plan to achieve:

Related work reading and sequential implementation using Hopscotch hashing. [11/15]
Coarse-grained parallel implementation using Hopscotch hashing [11/22]
Analysis for naive parallel implementation and improve for performance in finer-grained manner. [11/29]
Performance benchmarking on different concurrent implementation of Hashmap [12/4]
*Optimization based on write-expensive storage medium [12/10]

The percentage following the items are the goal-setting: If the implementation of sequential version go slowly, we will at lease finish the coarse-grained version of Hopscotch hashing technique to see the performance result. We are mainly aiming at finishing the third and fourth point, which is to implement fine-grained Hopstoch hashing concurrent hashmap. Considering the actual application, the workload will be dominant by read, while interleaved with write operations.

For the fifth point made above, it is a 'Hope-to-achieve' stretch goal that we don't actually have an idea for design and implementation, currently. The nature of write-expensive storage medium like Non-volatile memory can lead to different design of data structures.

Evaluation Method and Challenge

The concurrent Hashing techniques have been explored by researchers for years. The baseline we select for benchmarking result is ConcurrentHashMap.java from java.concurr.util. To the best of our knowledge, this is the most efficient concurrent hash algorithm. Also, we mentioned a similar but older techniques called Cuckoo hashing, implemented by Bin. To evaluate our result, we will release a benchmarking result between our hopscotch hashing-based implementation and those existing implementation, with respect to scalability, throughput and memory efficiency. The final result will be illustrated in chart indicating the performance and scalability versus number of threads.

The challenge comes from the rigorousness of data structure design: We need to ensure the generality of the data structure, handling different race condition and deadlock problems, while emphasize on read-heavy workload. Furthermore, the rehashing is always a major difficulties for hashmap implementation.

For the stretch goal, there are obstacles for both implementation and experiments. For the implementation, as we have mentioned, there's no specific paper or research result that we could refer to, to handling workload under write-expensive storage medium, for a specific data structure. For the experiments, we may need to apply for the access of Non-volatile memory emulator from Prof.Andy Pavlo, this is also not guaranteed.

Midway Checkpoint

What have we been doing for this project in the past few weeks? We have throughly read the paper in which Hopscotch Hashing is proposed. By reading this paper and some other related work, such as Lea's implementation of Java's ConcurrentHashMap, we have obtained an general idea about what might affect the performance of an concurrent Hashtable, and techniques of optimization. After we understand the details of Hopscotch Hashing, we have implemented, in java, an serial version and a coarse-grained concurrent version of it.

What have we learnt so far? Hopscotch hashing is an open addressing hashing algorithm. In the paper where Hopscotch hashing is proposed, the author gives pseudocode for the concurrent version of Hopscotch hashing. It is worth mentioning that the author uses bucket-level locking, which is fine if the only contains(), add() and remove() methods are needed. However in practice, some table-level lock is required, such as calls to getSize(), isEmpty() or rehash(). In this case, having bucket-level lock is not a good idea. The solution is proposed by Doug Lea, in his implementation of ConcurrentHashMap. Instead of having a single table-level lock, we can have N reentrant lock, each covering a continuous portion of the array of buckets. To insert new pair, simply grab the lock corresponding to the portion where the pair will be inserted. To lock the entire table, recursively grab all the locks. We can further optimize the above locking scheme, by using 2 locks for each portion(thus 2N locks in total). One is for sharedLock(readLock), the other is for exclusiveLock(writeLock). This can be helpful, as for a typical hash table, 90% of the operations are read operations only. Another idea we would like to try is borrow from MemC3[3], which is to separate Cuckoo path’s searching and execution, thereby reducing the size of critical section. In Hopscotch Hashing, there’s also a similar process to search a path for swapping a far empty slot into H-range of original hash position, and physically execute each elements swapping along the path. This idea is optional given the time limits.

Project State

As mentioned in the first part, we have implemented an serial version and coarse-grained concurrent version of Hopscotch hashing table in Java. We have also made it clear what will we benchmark our final implementation of Hopscotch hashing against. In particular, we will compare the speedup of fixed-size data insertion and static data read, on the latedays machines (Intel Xeon E5645 CPU) on the following occasions:

A single table-level lock for read and write.
A Hashtabel-level read lock and write lock.
N locks for read and write.
N locks for read and N locks for write.
Java’s ConcurrentHashMap.

Through the first four schemes we are aiming at verifying the scalability of our design towards multiple cores. Specifically, starting with single-core experiment, we add more cores into the experiment setting(control through CPU affinity[1]) until the change of speed up becomes non-obvious. For the fifth point, Java’s java.util.concurrent.ConcurrentHashMap, to the best of our knowledge, is one of the fastest thread-safe Hash table that we can find on JVM. We will compared our Hopscotch Hashing table with this widely-used implementation in terms of read/write throughput, scalability and space efficiency. Especially, as the Hopscotch claims to maintain satisfying performance even when the Hash table is over 90% full[2], we are expecting to see the superiority, in terms of read/write throughput in such circumstance. We have to admit, though, that we are slightly behind our schedule. So we both decided to prioritize this project and spend more time on it during the next few weeks. Hopefully by wisely spending the Thanksgiving vacation we get back on the track.

Poster Session

For Poster session, we will be presenting several graphs, including speedup, read/write throughput with respect to different number of cores. We will also have our machines set up such that people can come and see the running instances of our code, but that hardly qualifies as a demo.

Challenges

Our biggest challenge is lack of experience in designing and implementing such kind of fundamental data structure for speed. Also the separation of Hopscotch path discovery and execution requires subtle lock management. Currently we are still facing lot of nondeterministic bugs.

Reference

[1] http://stackoverflow.com/questions/8882885/is-it-possible-to-force-an-existing-java-application-to-use-no-more-than-x-cores [2] Herlihy, M., Shavit, N., & Tzafrir, M. (2008, September). Hopscotch hashing. In International Symposium on Distributed Computing (pp. 350-364). Springer Berlin Heidelberg.] [3] Fan, B., Andersen, D. G., & Kaminsky, M. (2013). MemC3: Compact and concurrent memcache with dumber caching and smarter hashing. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13) (pp. 371-384).