Ari | Last updated: Sun 18 Sep 2022 16:09:59 BST


Modified Chi-Squared Test for Uniformity Testing

Key Tricks used:

The chi-squared test appears remarkably similar to counting number of unique elements, but for some reason it is robust to local LDP randomisers.

First Attempt: Direct Chi-squared test

The issue here is that N_1,\dots, N_k are not independent. Therefore when we try to get the variance of T can’t linearise the variance as we could if we had independence. The covariance terms would be really annoying to analyse. In such a scenario, the classic thing to do is poissonise the problem. . Instead of seeing n i.i.d samples from \mathcal{P}, we see N \sim \mathtt{Poisson}(n) i.i.d samples from \mathcal{P}. This seemingless harmless randomisation, simplifies our analysis (see Poissonisation). By Poisson concentration N will very likely very close to n anyway but we now have independence.

Page 29 E: Check it.

Modified Chi-Squared Test

Let N \sim \mathtt{Poisson}(n) and x_1, \dots, x_N be i.i.d samples from \mathcal{P} \in \simplex{k}. Let N_j = \sum_{i=1}^N \unit[x_i = j] for j \in [k]. From Poissonisation we know N_j \sim \mathtt{Poisson}(np_j). Define the modified test statistic T to be

\begin{equation} T = \sum_{j \in [k]} \frac{(N_j - n/k)^2 - N_j}{n/k} \end{equation}

For ease of notation let \vec{p} = \mathcal{P} and \vec{u} = \mathcal{U}_k. Remember we want to know if d_{TV}(\vec{p}, \vec{u}) = 0 or d_{TV}(\vec{p}, \vec{u}) \geq \eps. Or equivalently, if ||\vec{p} - \vec{u}||_2 = 0 or ||\vec{p} - \vec{u}||_2 \geq \frac{4\eps^2}{k}1


Let N \sim \mathtt{Poisson}(n) and x_1, \dots, x_N be i.i.d samples from \mathcal{P} \in \simplex{k}. Let N_j = \sum_{i=1}^N \unit[x_i = j] for j \in [k]. Define T as defined above. Then we have

\begin{align} \E[T] &= nk||\vec{p} - \vec{u}||_2^2 \\ &= n\sum_{j \in [k]}\frac{(p_j - 1/k)^2}{1/k} \\ &= n\sum_{j \in [k]}\frac{(p_j - u_j)^2}{u_j} \\ &= n\cdot\chi^2(\vec{p} || \vec{u}) \end{align}

TODO

This is ideal as if \vec{p}=\vec{u}, we expect the mean to be close to 0. Otherwise, we expect nk||\vec{p} - \vec{u}||_2 = \geq nk\frac{4\eps^2}{k}. However there is a catch! The expectation gap between the two cases is \Delta:= (n4\eps^2 - 0). If the variance of the estimator is greater than \Delta, then the test is useless. So we have to ensure that \var[T] \ll \Delta^2


Let T as defined above and let \mu=\E[T] Then we have

\begin{equation} \var[T] \leq 4k(1 + \frac{\mu}{n}) + 4\sqrt{2k}\mu + 4\sqrt{\frac{2k}{n}}\mu^{3/2} \end{equation}

Let Y_j = \frac{(N_j - nu_j)^2 - N_j}{u_j}. From Lemma above \var[Y_j] = \frac{2np_j^2 + 4n^3p_j(p_j - u_j)^2}{u_j^2}.

\begin{align} \var{T} &= \var[\sum_{j=1}^k Y_j] \\ &= \sum_{j=1}^k \var[ Y_j] \\ &= \sum_{j=1}^k \frac{2np_j^2 + 4n^3p_j(p_j - u_j)^2}{u_j^2} \\ &= 2\sum_{j=1}^k \frac{p_j^2}{u_j^2} + 4n\sum_{j=1}^k\frac{p_j}{u_j}\Big(\frac{p_j - u_j}{u_j}\Big)^2 \\ &= 2\sum_{j=1}^k \frac{p_j^2}{u_j^2} + 4n\sqrt{\sum_{j=1}^k\frac{p_j^2}{u_j^2}}\sqrt{\sum_{j=1}^k \frac{(p_j - u_j)^4}{u_j^2}} \\ &= 2\sum_{j=1}^k \frac{p_j^2}{u_j^2} + 4n\sqrt{\sum_{j=1}^k\frac{p_j^2}{u_j^2}}\sum_{j=1}^k \frac{(p_j - u_j)^2}{u_j} \\ &= 2\sum_{j=1}^k \frac{p_j^2}{u_j^2} + 4nk\sqrt{\sum_{j=1}^k\frac{p_j^2}{u_j^2}}\sum_{j=1}^k (p_j - u_j)^2 \\ &= 2\sum_{j=1}^k \frac{p_j^2}{u_j^2} + 4\sqrt{\sum_{j=1}^k\frac{p_j^2}{u_j^2}}\E[T] \\ &\leq 4k(1 + \frac{\E[T]}{n}) + 4\sqrt{2k}\E[T] + 4\sqrt{\frac{2k}{n}}\E[T]^{3/2} \end{align}

Step (11): Cauchy Schwartz: \sqrt{\sum_j u_jv_j} \leq \sqrt{\sum_j u_ju_j}\sqrt{\sum_j v_jv_j}

Step (12): Monotonicity of norms || \cdot ||_2 \geq || \cdot ||_1

The last step comes from the fact that (a+b)^2 \leq 2a^2 + 2b^2. Therefore we have p_j^2 = \Big((p_j - u_j) + u_j\Big)^2 \leq 2(p_j - u_j)^2 + 2u_j^2 = 2(p_j - u_j)^2 + \frac{1}{k^2}. This gives us

\begin{align} \sum_{j=1}^k \frac{p_j^2}{u_j^2} &\leq \sum_{j=1}^k\frac{2(p_j - u_j)^2}{u_j^2} + 1 \\ &= 2k + k\sum_{j=1}^k \frac{2(p_j - u_j)^2}{u_j} \\ &= 2k(1 + \frac{\E[T]}{n}) \end{align} \square

Now if \vec{p}=\vec{u}, then as \mu=0, we have \var[T] \leq 4k. Then if 4k \ll 16n^2\eps^4 we should be good. In the far away case we want \var[T] \ll \Delta^2 \leq (nk|| p - u_k||^2_2)^2 = \E[T]^2 = \mu^2. Thus we need

\begin{equation} \max \Bigg\{ k, \frac{k\mu}{n}, \sqrt{k}\mu, \sqrt{\frac{k}{n}}\mu^{3/2} \Bigg\} \ll \mu^2 \end{equation}

If one looks at each case individually, it is easy to see n \gg \frac{\sqrt{k}}{\eps^2}, it all works out nicely. Now all that is left to do is apply Chebychev’s inequality and be a little formal.

Main Theoem

The \chi^2-tester described above is a testing algorithm for uniformity with sample complexity n(k, \eps, 1/3) = O(\frac{\sqrt{k}}{\eps^2}) and time complexity O(n) in the Poissonised setting.

Remember \Delta = 4n\eps^2. If \bvec{p} = \bvec{u}, then \E[T] = 0. Thus it suffices to set \tau= \frac{\Delta}{2}.

\begin{align} \Pr[T \geq \tau | \bvec{p}=\bvec{u}] &= \Pr[T - \E[T] \geq \tau | \bvec{p}=\bvec{u}] \\ &\leq \frac{4\var[T]}{\Delta^2} \\ \leq \frac{k}{\eps^4n^2} \end{align}

Setting \frac{k}{\eps^4n^2} = 1/3 and solivng for n, we gt what we want.

\begin{align} \Pr[T \leq \tau | d_{TV}(\bvec{p}, \bvec{u}) \geq \eps] &\leq \Pr[ | T - \E[T] | \geq \frac{\E[T]}{2} | d_{TV}(\bvec{p}, \bvec{u}) \geq \eps]\\ &\leq 4\frac{\var[T]}{\E[T]^2} \\ &\leq 4\frac{\var[T]}{\mu^2} \\ &\leq 4\frac{4k(1 + \frac{\mu}{n}) + 4\sqrt{2k}\mu + 4\sqrt{\frac{2k}{n}}\mu^{3/2}}{\mu^2} \\ &= \frac{16k}{\mu^2} + \frac{16k}{n\mu} + \frac{16\sqrt{2k}}{\mu} + \frac{16\sqrt{\frac{2k}{n}}}{\sqrt{\mu}} \\ &\leq \frac{16k}{\Delta^2} + \frac{16k}{n\Delta} + \frac{16\sqrt{2k}}{\Delta} + \frac{16\sqrt{2k}}{\sqrt{n\Delta}} \\ &= \frac{16k}{16n^2\eps^4} + \frac{16k}{4n^2\eps^2} + \frac{16\sqrt{2k}}{4n\eps^2} + \frac{16\sqrt{2k}}{2n\eps} &\leq \frac{5k}{n^2\eps^4} + \frac{12\sqrt{2k}}{n\eps^2} \end{align}

Solve the inequality 5x^2 + 12\sqrt{2}x \leq 1/3 where x = \sqrt{k}, we get what we want.

Auxillary Lemmas

Let X \sim \mathtt{Poisson}(\lambda) and \mu > 0. Let Y=(X-\mu)^2 - X, then we have

\begin{align} \E[Y] &= (\lambda - \mu)^2 \\ \E[Y^2] &= (\lambda - \mu)^4 + 2\lambda^2 + 4\lambda(\lambda - \mu)^2 \end{align}

References and Footnotes


  1. d_{TV}(\vec{p}, \vec{u}) = \frac{1}{2}||\vec{p} - \vec{u} ||_1 \leq \frac{\sqrt{k}}{2}|| \vec{p} - \vec{u}||_2. Re-arranging the terms you get the claimed inequality↩︎