Current Research
In the following we outline three research directions we are pursuing at the moment. Note that we are also working on other small projects not listed here.
Perfect Hashing and Consensus
A perfect hash function for a set of objects is a data structure that injectively maps those objects to unique IDs. Rather than storing information on the objects in any classical way, some perfect hash functions store seemingly meaningless numbers known as seed values such that a pseudo-random function producing IDs from the object and those seed values happens to satisify the desired uniqueness property.
We have developed an strategy for computing and storing such seed values called Combined Search and Encoding of Successful Seeds (Consensus) and won the Best Paper Award at ESA 2025. Paired with one of the most space efficient approaches to perfect hashing we achieve unprecedented space efficiency overall.
We are currently investigating ideas for combining the Consensus idea with the fastest to perfect hashing to see if we can match their speed while improving their space efficiency.
k-perfect Hashing
Hash functions are routinely used to randomly partition a set into parts of expected equal size, such as when assigning jobs to machines. A k-perfect hash function is a tiny data structure constructed for the set that guarantees that none of the part’s sizes exceeds k. Within static data structures such perfectly balanced partitioning can help with retrieving data with fewer cache misses and branch mispredictions.
While k-perfect hashing has received some attention in the past, even basic questions such as how much space such data structures must need have not been satisfactorialy answered. We have explored some approaches to k-perfect hashing in an ESA 2025 paper and intend to provide a more comprehensive account in the future.
Average Case Analysis and Learned Data Structures
Oftentimes an algorithm’s or data structure’s performance depends on the input data, i.e. there are easy cases and hard cases. In the absense of further assumptions one typically reports the performance in the hardest case as a conservative upper bound. Focusing on worst case performance in this way neglects that practical input data is often structured in a way that allow approaches that are aware of this structure to be significantly faster or, in the case of data structures, smaller than approaches that expect the worst.
There are two overlapping ways to think about such opportunities.
- Average Case Analysis assumes a specific distribution of the input data (such as a uniform distribution) and designs an algorithm or data structure that has good expected performance if the data truely comes from such a distribution. We are actively working on a data structure for sorted sequences (supporting predecessor and successor operations). It is well-known that search trees offer optimal (logarithmic) worst-case performance but much better performance (constant time) is possible for certain kinds of random data. We aim to simplify known solutions.
- Learned Data Structures combine classical data structures with machine learning techniques, which could be anything from a simple linear regression model to a powerful LLM. The intuition is that if the machine learning model manages to learn the data well then the classical part of the data structure need only represent the data to the extend to which the model is uncertain or wrong. We managed to apply to idea to the retrieval problem and created a learned retrieval data structure (to appear in VLDB) that, in some cases, yield significantly compression compared to classical retrieval data structures.