Collision Based Testing

Given n i.i.d samples X_1, \dots, X_n let Z_1 = \frac{1}{\binom{n}{2}}\sum_{1 \leq s < t \leq n} \unit(X_s = X_t) be the empirical average number of collisions in the dataset. Let \tau=\frac{1+2\eps^2}{k} and define

\mathcal{T}(X_1, \dots, X_n)=\begin{cases} 0, & Z_1 \leq \tau\\ 1, & Z_1 > \tau \end{cases}

Then as long as n \geq O(\frac{\sqrt{k}}{\eps^2}) we have

\begin{equation} \Pr{\Bigg[ \mathcal{T}(X_1, \dots, X_n) = 0 \Big| d_{TV}(p, U_K) = 0 \Bigg]} \geq \frac{2}{3} \end{equation}

and

\begin{equation} \Pr{\Bigg[ \mathcal{T}(X_1, \dots, X_n) = 1 \Big| d_{TV}(p, U_K) \geq \eps \Bigg]} \geq \frac{2}{3} \end{equation}

The proof for samples for sample complexity can be found in Chapters 2.1 of (Canonne, 2022). I was able to re-derive them here.

Simulations

Code to play with k and n to see how the accuracy changes

References and Footnotes

Canonne, C.L., 2022. Topics and Techniques in Distribution Testing: A Biased but Representative Sample.