This note is part of my learning notes on Data Structures & Algorithms.


Introduction

Hash tables are a type of data structure that allows fast INSERT/DELETE/SEARCH operations. Each operation will take constant time O(1)O(1).

  • Insert(k): Insert key kk into the hash table.
  • Lookup(k): Check if key kk is present in the table.
  • Delete(k): Delete the key kk from the table.

Let UU be the universe of all keys, the size of this universe is MM, where MM is really big. In a hash table of size nn, each key kUk \in U is mapped to one of nn “buckets” by a hash function.

A hash function h:U{1,...,n}h: U \rightarrow \{ 1,...,n\} is a mathematical function that takes an input and returns an output of fixed-size string of bytes, which is typically a hash value (maps elements of UU to buckets 1,...,n1,...,n). This value is often represented as a hexadecimal number.

Examples of hash function:

  • division
  • multiplication
  • universal hashing
  • dynamic perfect hashing
  • static perfect hashing

This technique called hashing determines an index or location for the storage of an item in a data structure.


Random Hash Functions

We want a hash table to not have too many buckets (to save space), and we want the items to be spread-out in the buckets (enable fast operations). Hence, we incorporate randomness in hashing. There are two ways to do so:

  1. Assume that the set of keys stored in the hash table is random, or
  2. Assume that the hash function hh is random.

Let there be nn number of keys, XX be the size of the hash bucket that xix_i maps to. The expected cost of performing any of the operations INSERT/DELETE/SEARCH with a random hash function is:

E[X]=j=1nPr[h(xi)=h(xj)]=1+jinPr[h(xi)=h(xj)]=1+n1n2\begin{aligned} E[X] &= \sum^{n}_{j=1} Pr\left[ h(x_i) = h(x_j) \right] \\ &= 1 + \sum^{n}_{j \neq i} Pr\left[ h(x_i) = h(x_j) \right] \\ &= 1 + \frac{n-1}{n} \\ &\leq 2 \end{aligned}

When hh is random,

Pr[h(xi)=h(xj)]=1nPr\left[ h(x_i) = h(x_j) \right] = \frac{1}{n}

Algorithm:

  1. Choose any nn items u1,u2,...,unUu_1, u_2, ..., u_n \in U, and any sequence of INSERT/DELETE/SEARCH operations on these items.
  2. Choose a random hash function h:U{1,...,n}h: U \rightarrow \{ 1,...,n\}.
  3. Hash it out.

Collisions

Suppose there’s an array AA of some size MM and a hash function h:U{0,...,M1}h: U \rightarrow \{0,...,M-1\}. A collision is when h(x)=h(j)h(x) = h(j) for two different keys xx and yy.

Claim: for any hash function hh, if U(N1)M+1|U| \geq (N-1)M+1, there exists a set SS of NN elements that all hash to the same location.

We can handle collision with a method called separate chaining, by having each entry in AA be a linked list. To insert an element, we just put it at the top of the list. If hh is a good function, then we hope that the lists will be small.


Hashing By Division

Suppose there is a hash table of size mm, using division, the hash function is:

h(k)=k mod mh(k) = k \text{ } mod \text{ } m


Universal Hashing

A uniformly random hash function can lead to balanced buckets, decreasing the number of collisions, and ensuring all INSERT/DELETE/SEARCH operations take O(1)O(1) constant time. But it’s not a good idea to use this method because we will have to store them and there’s nMn^M of them. Writing down a random one of them takes log(nm)log(n^m) bits, which is Mlog(n)M log(n).

The solution is to pick from a smaller set of hash functions. This chosen subset collection of hash functions is called a hash family HH. When hh is chosen uniformly at random from HH, it is said to be universal if: for each pair of different keys xx and yUy \in U, the the probability that they hash to the same value (collision) is at most 1m\frac{1}{m}.

Formally: HH is universal if:

x,yU,xy\forall x,y \in U, x \neq y

Pr[h(x)=h(y)]1mPr\left[ h(x) = h(y) \right] \leq \frac{1}{m}

For a small universal hash family of size O(m2)O(m^2), we need only O(log m)O(log \text{ }m) bits to store it.

Properties of Universal Hashing

  1. Low Collision Probability: Universal hashing ensures that the probability of any two distinct keys colliding is low, specifically at most 1m\frac{1}{m}.
  2. Uniform Distribution: When a hash function is chosen randomly from a universal family, the hash values are uniformly distributed over the range, reducing the chances of clustering.
  3. Independence: The choice of hash function is independent of the keys, making it less likely for an adversary to predict collisions.
  4. Efficiency: Universal hash functions can be computed efficiently, typically in constant time O(1)O(1).

Theorem

For nn arbitrary distinct keys, and for random hash function chosen from universal hash family hHh \in H

E(num of keys in slot that collide)1+α1+nm\begin{aligned} E(num \text{ } of \text{ } keys \text{ } in \text{ } slot \text{ } that \text{ } collide) &\leq 1 + \alpha \\ &\leq 1+ \frac{n}{m} \end{aligned}

Matrix method

The matrix method is a way to construct a universal hash family.


Perfect Hashing

Perfect hashing is a type of hashing that guarantees no collisions for a given set of keys, providing optimal performance with O(1)O(1) lookup time.

We say a hash function is perfect for a set SUS \in U if it maps all elements of SS to distinct values

h:S{0,1,...,m1}h: S \rightarrow \{0,1,...,m-1\}

such that

x,yS,xy\forall x,y \in S, x \neq y

h(x)h(y)h(x) \neq h(y)

Types of Perfect Hashing

  • Static Perfect Hashing: A perfect hash function where a set of keys is fixed, and the function ensures no collisions among these keys.
  • Dynamic Perfect Hashing: A perfect hash function that can handle a changing set of keys without collisions.

Properties of Perfect Hashing

  • No collisions: by definition, a perfect hash function guarantees no collisions among the keys in the set SS
  • Efficiency: optimal performance with O(1)O(1) lookup time.

Example: Balls And Bins Model

The balls and bins model is a probabilistic model used to analyze random allocation or distribution of objects. In this model, a set of balls represent distinct objects, and a set of bins represent distinct containers or possibilities.

I. No Collision (Birthday Paradox)

Suppose we throw nn balls randomly into mm bins. (Here the balls represents the keys to be hashed and the bins represent the slots in the hash table)

Probability that no two balls collide:

Pr[no 2 balls collide]=1×(m1m)×(m2m)×...×(mn+1m)=(m1)(m2)...(mn+1)mn1=1(1m)n1\begin{aligned} Pr\left[ no \text{ } 2 \text{ } balls \text{ } collide \right] &= 1 \times \left( \frac{m-1}{m}\right) \times \left( \frac{m-2}{m}\right) \times ... \times \left( \frac{m-n+1}{m}\right) \\ &= \frac{(m-1)(m-2)...(m-n+1)}{m^{n-1}} \\ &= 1 - \left( \frac{1}{m} \right)^{n-1} \end{aligned}

This shows us how large mm needs to be to have no collisions.

Max Load

Suppose m=nm=n, which means we throw nn balls randomly into nn bins. What is the maximum number of balls in any bin?


Example: Bloom Filter

A Bloom filter is a space-efficient, probabilistic data structure that is based on hashing. It is typically used to add a hash of the elements to a set to determine whether that element belongs to a set. Example: checking the availability of a username.

A Bloom filter is similar to a hash table, where it will use a hash function to map a key to a bucket. However, it will not store that key directly in that bucket, instead it will hash it into multiple hash functions and simply mark the bucket as filled. Hence some keys might map to the same filled bucket, causing false positives.

False positives occur when an element is not present in the original set, but the Bloom filter mistakenly indicates that it is. The probability of a false positive Pr(FP)Pr(FP) depends on the number of elements inserted into the filter nn and the capacity of the filter MM

Pr(FP)=(1emh)hPr(FP) = (1 - e^{-\frac{m}{h}})^h


Steps to using a Bloom filter:

  1. Get input
  2. Calculate hash value
  3. Mod the hash
  4. Insert the hash
  5. Lookup the value

Conclusion

A hash table can be designed such that it

  • has hash function that maximizes randomness and produce least amount of collisions
  • supports INSERT/DELETE/SEARCH operations in O(1)O(1) expected time
  • requires O(nlog(M))O(nlog(M)) bits of space.
    • O(n)O(n) buckets
    • O(n)O(n) items with log(M)log(M) bits per item
    • O(log(M))O(log(M)) to store the hash function

U: universe
u: number of possible keys in universe
n: number of actual keys to store in hash table
k: key
x,y: a pair of distinct keys
m: size of hash table / number of slots in table / number of buckets


References