Data Streaming Algorithms

This note is part of my learning notes on Data Structures & Algorithms.

Introduction

Data streaming algorithms are a set of techniques used for processing data that is too large to fit into memory all at once or that arrives continuously over time. These algorithms are designed to operate on data streams, which are sequences of data elements that are generated over time and can be processed in a single pass or with limited memory. They are typically aim to provide approximate answers to various types of queries or analyses on the data stream while using minimal memory and computational resources.

Heavy Hitters

Heavy Hitters is a task that uses streaming algorithms to identify elements that appear frequently or exceed a certain threshold in the stream.

Input: A stream of items $x_1, x_2, ..., x_m \in U$ , where $U$ is a large universe with size $|U| = N$ .

Output: A list $\{ x \in U: x \text{appeared a lot times for } \geq \epsilon \cdot m \text{ times in the stream} \}$

Count-Min Sketch

Count-Min Sketch is a probability data structure used for approximate counting in streaming data. It efficiently estimates the frequency of elements in a data stream using limited memory space.

Algorithm

1. Initialize a fixed-size array of counters, typically organized as a two-dimensional array. Each row represents a hash function, and each column represents a counter.
2. When an element from the data stream arrives, it is hashed using multiple hash functions. Each hash function generates an index in the array.
3. For each hash function, the corresponding counter in the array is incremented.
4. To estimate the frequency of an element, the algorithm hashes the element using the same hash functions and retrieves the minimum value among the corresponding counters. This minimum value is an approximation of the frequency of the element.

References

Mary Wootters: Heavy Hitters and Count-Min Sketch