This note is part of my learning notes on Data Structures & Algorithms.


Introduction

Randomized Algorithms are algorithms that solves a problem step-by-step by making random choices, from a finite set of possible steps. We call an algorithm randomized if its behavior is determined not only by its input but also by values produced by a random number generator (r1,...,Rr \in {1,...,R}). Hence, a randomized algorithm may produce different outputs when applied more than one times to the same input.

We use randomized algorithms because they can help speed up a decision algorithm, resulting in shorter running time (reduced time complexity).

Randomized algorithms can be designed either as a Las Vegas algorithm or a Monte Carlo algorithm. A Las Vegas algorithm is weaker than a Monte Carlo algorithm, so we often design a Monte Carlo algorithm first.

  • Monte Carlo algorithm: may produce incorrect results with a certain probability, but it will stop after a fixed amount of time or after a certain number of iterations. A common example is the Karger-Stein’s algorithm.
    • output is a random variable
    • running time is bounded by something deterministic
  • Las Vegas algorithm: keep running the algorithm until it produces the correct result, but the running time may vary. A common example is the randomized quicksort algorithm.
    • output is deterministic
    • running time is a random variable

Indicator Random Variable

An indicator random variable is a function that assigns a real value to an outcome in the sample space of a random experiment. It provides a convenient method for converting between probabilities and expectations. Suppose we are given a sample space SS and an event AA. Then the indicator random variable I{A}I\{A\} associated with event AA is defined as:

I{A}={1,if A occurs0,if A does not occurI\{A\} = \begin{cases} 1, \text{if A occurs}\\ 0, \text{if A does not occur} \end{cases}


High Probability Bounds

In randomized algorithms, high probability bounds refer to probabilistic guarantees on the performance or behavior of the algorithm. Specifically, these bounds indicate that the algorithm will achieve certain outcomes with high probability, meaning that the probability of failure is very low.

For example, if an algorithm has a high probability bound of 1δ1 - \delta, it means that the algorithm will produce the correct result or behave as expected with probability at least 1δ1 - \delta where δ\delta is a small constant typically representing the probability of failure.

Markov’s Inequality

Markov’s inequality provides a loose upper-bound on the probability that a non-negative random variable is greater than or equal to a positive constant. It is useful for providing a quick and simple estimate of the upper bound on the probability that a random variable deviates from its expected value. This is particularly useful in algorithms where the exact distribution of the variable is unknown but its expectation is known.

Theorem: for any non-negative random variable XX and all positive constant δ>0\delta > 0:

P(Xδ)E[X]δ,for X0,δ>0P(X \geq \delta) \leq \frac{E[X]}{\delta}, \text{for } X \geq 0, \delta >0

where:

  • P(Xδ)P(X \geq \delta) is the probability that the random variable XX is greater than or equal to δ\delta.
  • E[X]E[X] is the expected value (mean) of the random variable XX.
  • δ\delta is the threshold (XX is at least δ\delta).

Chebyshev’s Inequality

Chebyshev’s inequality is an extension of Markov’s Inequality. It provides an upper bound on the probability and helps quantify how much a random variable can deviate from its mean, which is critical in algorithms where high reliability or consistency is required despite the presence of randomness.

Theorem: for any random variable XX with finite mean μ\mu and finite non-zero variance σ2\sigma^2, and for any positive constant kk:

P(Xμδ)Var[X]δ2P(|X - \mu| \geq \delta) \leq \frac{Var[X]}{\delta^2}

where:

  • P(Xμδ)P(|X-\mu| \geq \delta) is the probability that the absolute deviation of the random variable XX from its mean μ\mu is greater than or equal to kk times the standard deviation σ\sigma.
  • μ=E(X)\mu = E(X) is the mean (expected value) of the random variable XX.
  • Var[X]=σ2Var[X] = \sigma^2 is the variance of the random variable XX.

Chernoff Bounds

Chernoff Bounds are particularly useful in settings where an algorithm needs to ensure that the probability of an undesirable outcome (e.g., large deviation from the mean) is extremely low.

Given a sequence of independent random variables X1X_1, X2X_2, ......, XnX_n with 0Xi10 \leq X_i \leq 1 and their sum S=X1+X2+...+XnS = X_1 + X_2 + ... + X_n, the Chernoff bounds give upper bounds on the probability that SS deviates significantly from its expected value E[S]E[S].

There are several forms of Chernoff bounds, but one common form is the following:

P(S(1+δ)E[S])eδ2E[S]3P(S \geq (1 + \delta)E[S]) \leq e^{-\frac{\delta^2E[S]}{3}}

P(S(1δ)E[S])eδ2E[S]2P(S \leq (1 - \delta)E[S]) \leq e^{-\frac{\delta^2E[S]}{2}}

Comparison between Markov, Chebyshev, and Chernoff Bounds:

The bound given by Markov is the “weakest” one. It is constant and does not change as n increases. The bound given by Chebyshev’s inequality is “stronger” than the one given by Markov’s inequality. The strongest bound is the Chernoff bound. It goes to zero exponentially fast.

Binomial Distribution

The binomial distribution is the discrete probability distribution which shows the number of success in a sequence of nn independent Bernoulli trials (each with two possible outcomes, and with the same probability of success pp).

Probability mass function:

f(k;n,p)=Pr[X=k]=(kn)pk(1p)nk,0kn\begin{aligned} f(k;n,p) &= Pr[X=k] \\ &= (^n_k) p^k (1-p)^{n-k}, 0 \leq k \leq n \\ \end{aligned}

Cumulative distribution function:

F(k;n,p)=Pr[Xk]=i=0k(in)pi(1p)ni,0kn\begin{aligned} F(k;n,p) &= Pr[X \leq k] \\ \displaystyle &= \sum^k_{i=0} (^n_i) p^i (1-p)^{n-i}, 0 \leq k \leq n \end{aligned}

Normal Distribution

Normal distribution is also known as the Gaussian distribution. It describes data that tends to cluster around a mean. Many random processes tend towards a normal distribution as the number of trials increases (Central Limit Theorem).

When μ=E[X]=np\mu = E[X]=np and σ2=Var[X]=np(1p)\sigma^2 = Var[X]=np(1-p) are large enough, the binomial distribution can be approximated with a normal distribution.

nn: number of trials
pp: probability of success in each trial

Normal distribution with mean μ\mu and variance σ2\sigma^2 is:

f(x;μ,σ2)=1σ2πe12(xμσ)2,xRf(x;\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} e^{- \frac{1}{2}(\frac{x-\mu}{\sigma})^2}, x \in R

Poisson Distribution

Poisson distribution describes the probability of a given number of events happening in a fixed interval of time or space, under the assumption that these events happen with a known constant mean rate and independently of the time since the last event.

In the analysis of randomized algorithms, particularly those involving random sampling, the number of iterations or trials until a certain event occurs can often be modeled using a Poisson distribution (especially for Monte Carlo methods).

If nn is huge and pp is very small, the binomial distribution can be approximated by a Poisson distribution with μ=np\mu = np

Poisson distribution with mean μ>0\mu > 0 is:

f(k;μ)=μkeμk!,k=0,1,2,...f(k;\mu) = \frac{\mu^ke^{-\mu}}{k!}, k=0,1,2,...


Example: Coupon Collector

Suppose there is a lucky draw, in which you can get nn different types of coupons. Each time you collect a coupon, it’s chosen uniformly at random from the available types. The goal is to determine the expected number of draws to get a full collection of all coupons.

Algorithm:

  1. Initialize an empty set to keep track of the types of coupons collected.
  2. Repeat the following steps until all types of coupons are collected:
    a. Choose a coupon uniformly at random from the available types.
    b. If the chosen coupon is not already in the set, add it to the set.
  3. Count the total number of coupons collected until all types are collected.

Analysis:
Let XX be the number of draws to get a full collection, XiX_i is the number of draws to get the next different coupon, under the condition that you already have i1i-1 different types of coupon.

X=i=1nXi=X1+X2+X3+...+Xn\begin{aligned} X &= \sum^{n}_{i=1} X_i \\ &= X_1 + X_2 + X_3 + ... + X_n \end{aligned}

E[Xi]=nni+1E[X_i] = \frac{n}{n-i+1 }

E[X]=i=1nE[Xi]=i=1n(nni+1)=n(1n+1n1+...+12+11)=nlogn+O(n)\begin{aligned} E[X] &= \sum^{n}_{i=1} E[X_i] \\ &= \sum^{n}_{i=1} \left( \frac{n}{n-i+1} \right) \\ &= n \left( \frac{1}{n} + \frac{1}{n-1} + ... + \frac{1}{2} + \frac{1}{1} \right) \\ &= nlogn + O(n) \end{aligned}


Example: Randomized Quicksort

Quicksort is a popular algorithm used to sort numbers, it works by first choosing a pivot element, then sorting the numbers around that particular pivot. A randomized quicksort is designed to decrease the chances of the quicksort algorithm being executed in the worst case time complexity of O(n2)O(n^2).

The running time of a randomized quicksort is independent of the input order. No assumptions need to be made about the input distribution. The worst case is determined only be the output of a random-number generator.

Algorithm:

  1. Randomly select a number xx to be the pivot
  2. Divided the remaining numbers into two groups:
    • those smaller than xx
    • those larger than xx
  3. Sort each group recursively, and concatenate them.

Analysis:
We analyze the number of comparisons needed to fully sort all the numbers.

Let T(n)T(n) be the random variable for the running time of randomized quicksort on an input size of nn, where random numbers are independent.

For i=0,1,...,n1i=0,1,...,n-1, the indicator random variable is as follows:

Xi={1,If partition generates ak:nk1split0,otherwiseX_i = \begin{cases} 1, \text{If partition generates a} k:n-k-1 \text{split} \\ 0, \text{otherwise} \end{cases}

Since all splits are likely equal, assuming elements are distinct, then:

E[Xi]=Pr[Xi=1]=1nE[X_i] = Pr[X_i = 1] = \frac{1}{n}

T(n)=O(nlogn)T(n) = O(nlogn)

Randomized Quicksort always outperform deterministic algorithms like heap sort and merge sort.


Example: Karger-Stein’s Algorithm

Karger-Stein’s algorithm is used to find minimum cuts in graphs, by iteratively contracting randomly chosen edges until only two edges remain.

Given a connected graph G=V(E)G=V(E) which is undirected, unweighted, and has V=n|V|=n number of vertices, E=m|E|=m number of edges. We need to compute the smallest set of edges that will make GG disconnected.

A cut in a graph is a partition of the vertices into two non-empty parts. An edge crosses the cut if it goes from one side of the cut to the other. A minimum cut is a cut that has the fewest possible edges crossing it.

Algorithm:

  1. Pick a random edge
  2. Contract the chosen edge (merge the two endpoints together)
  3. Repeat steps 1 and 2, until there are only two vertices left
  4. Return the cut that corresponds to those two vertices

Theorem: the probability that Karger-Stein’s algorithm returns a minimum cut in a graph with nn vertices is at least 2n(n1)\frac{2}{n(n-1)}.


Example: Polynomial Identity Testing

Given a polynomial P(x1,x2,x3)=x32(x1+x2)(x1x2)+(x12+(1+x2)(1x2))x32x32(x1+x2)(x1x2)P(x_1,x_2,x_3) = x_3^2(x_1+x_2)(x_1-x_2) + \left( x_1^2+(1+x_2)(1-x_2)\right)x_3^2 - x_3^2(x_1 + x_2)(x_1 - x_2), we want to find out if this polynomial is equivalent to 0.

In fundamental theorem of algebra: a nn-variable polynomial with dd degreees has at most dd roots.

Analysis:
For P(x1,x2,...,xn)P(x_1,x_2,...,x_n) which has sample space SS,

  1. Choose nn number of random values r1,r2,...,rnr_1, r_2, ... , r_n independently and uniformly from SS.
  2. Eval(P(r1,r2,...,rn)P(r_1,r_2,...,r_n))

{if eval=0, outputP0else, output P0\begin{cases} \text{if eval=0, output} P \equiv 0 \\ \text{else, output } P \neq 0 \end{cases}

When nn is true, P(x1,x2,...,xn)=x1kQ(x2,...,xn)+T(x1,x2,...,xn)P(x_1,x_2,...,x_n) = x_1^k \cdot Q(x_2,...,x_n) + T(x_1,x_2,...,x_n), the degree(x1)<k(x_1) < k

(x1,x2,...,xn)=(r1,r2,...,rn)(x_1,x_2,...,x_n) = (r_1, r_2, ..., r_n)

The possibility that QQ is equal to zero:

Pr[Q=0]dkSPr[Q=0] \leq \frac{d-k}{|S|}

The possibility that QQ isn not equal to zero:

Pr[Q0]1dkSPr[Q \neq 0] \geq 1 - \frac{d-k}{|S|}

The possibility of TT is not zero when QQ is zero:

Pr[T0Q=0]=Pr[T(x1,x2,...,xn)kS]Pr[T \neq 0 | Q=0 ] = Pr\left[T(x_1,x_2,...,x_n) \leq \frac{k}{|S|} \right]


Further Notes

Expectation

The expectation (or expected value) of a random variable XX is denoted by E(X)E(X), which is a measure of the central tendency of the distribution of XX.

  • For discrete random variables

E(X)=i=1np1,p2,...,pnE(X) = \sum^n_{i=1} p_1, p_2,..., p_n

  • For continuous random variables

E(X)=xf(x)dxE(X) = \int^{\infty}_{-\infty} xf(x) dx

Variance

The variance of a random variable XX is denoted by Var(X)Var(X) or σ2\sigma^2. It measures the spread or dispersion of the distribution of XX.

Var(X)=E[(XE(X))2]Var(X) = E[(X-E(X))^2]


References