Ari | Last updated: Sun 18 Sep 2022 16:09:59 BST

Modified Chi-Squared Test for Uniformity Testing

Key Tricks used:

Poissonisation to go from multinomials to poissons to get independence. These allows us to get linearity of variance which makes the analysis tractable.
Cauchy Schwartz to bound variance of estimator.
Monotonicity of l_p norms to bound variance of estimator.
Chebychev To state the final theorem to create a probabilistic upper bound in sample complexity.

The chi-squared test appears remarkably similar to counting number of unique elements, but for some reason it is robust to local LDP randomisers.

First Attempt: Direct Chi-squared test

The issue here is that N_1,\dots, N_k are not independent. Therefore when we try to get the variance of T can’t linearise the variance as we could if we had independence. The covariance terms would be really annoying to analyse. In such a scenario, the classic thing to do is poissonise the problem. . Instead of seeing n i.i.d samples from \mathcal{P}, we see N \sim \mathtt{Poisson}(n) i.i.d samples from \mathcal{P}. This seemingless harmless randomisation, simplifies our analysis (see Poissonisation). By Poisson concentration N will very likely very close to n anyway but we now have independence.

Page 29 E: Check it.

Modified Chi-Squared Test

Let N \sim \mathtt{Poisson}(n) and x_1, \dots, x_N be i.i.d samples from \mathcal{P} \in \simplex{k}. Let N_j = \sum_{i=1}^N \unit[x_i = j] for j \in [k]. From Poissonisation we know N_j \sim \mathtt{Poisson}(np_j). Define the modified test statistic T to be

\begin{equation} T = \sum_{j \in [k]} \frac{(N_j - n/k)^2 - N_j}{n/k} \end{equation}

For ease of notation let \vec{p} = \mathcal{P} and \vec{u} = \mathcal{U}_k. Remember we want to know if d_{TV}(\vec{p}, \vec{u}) = 0 or d_{TV}(\vec{p}, \vec{u}) \geq \eps. Or equivalently, if ||\vec{p} - \vec{u}||_2 = 0 or ||\vec{p} - \vec{u}||_2 \geq \frac{4\eps^2}{k}¹

Let N \sim \mathtt{Poisson}(n) and x_1, \dots, x_N be i.i.d samples from \mathcal{P} \in \simplex{k}. Let N_j = \sum_{i=1}^N \unit[x_i = j] for j \in [k]. Define T as defined above. Then we have

\begin{align} \E[T] &= nk||\vec{p} - \vec{u}||_2^2 \\ &= n\sum_{j \in [k]}\frac{(p_j - 1/k)^2}{1/k} \\ &= n\sum_{j \in [k]}\frac{(p_j - u_j)^2}{u_j} \\ &= n\cdot\chi^2(\vec{p} || \vec{u}) \end{align}

TODO

This is ideal as if \vec{p}=\vec{u}, we expect the mean to be close to 0. Otherwise, we expect nk||\vec{p} - \vec{u}||_2 = \geq nk\frac{4\eps^2}{k}. However there is a catch! The expectation gap between the two cases is \Delta:= (n4\eps^2 - 0). If the variance of the estimator is greater than \Delta, then the test is useless. So we have to ensure that \var[T] \ll \Delta^2

Let T as defined above and let \mu=\E[T] Then we have

\begin{equation} \var[T] \leq 4k(1 + \frac{\mu}{n}) + 4\sqrt{2k}\mu + 4\sqrt{\frac{2k}{n}}\mu^{3/2} \end{equation}

Let Y_j = \frac{(N_j - nu_j)^2 - N_j}{u_j}. From Lemma above \var[Y_j] = \frac{2np_j^2 + 4n^3p_j(p_j - u_j)^2}{u_j^2}.

\begin{align} \var{T} &= \var[\sum_{j=1}^k Y_j] \\ &= \sum_{j=1}^k \var[ Y_j] \\ &= \sum_{j=1}^k \frac{2np_j^2 + 4n^3p_j(p_j - u_j)^2}{u_j^2} \\ &= 2\sum_{j=1}^k \frac{p_j^2}{u_j^2} + 4n\sum_{j=1}^k\frac{p_j}{u_j}\Big(\frac{p_j - u_j}{u_j}\Big)^2 \\ &= 2\sum_{j=1}^k \frac{p_j^2}{u_j^2} + 4n\sqrt{\sum_{j=1}^k\frac{p_j^2}{u_j^2}}\sqrt{\sum_{j=1}^k \frac{(p_j - u_j)^4}{u_j^2}} \\ &= 2\sum_{j=1}^k \frac{p_j^2}{u_j^2} + 4n\sqrt{\sum_{j=1}^k\frac{p_j^2}{u_j^2}}\sum_{j=1}^k \frac{(p_j - u_j)^2}{u_j} \\ &= 2\sum_{j=1}^k \frac{p_j^2}{u_j^2} + 4nk\sqrt{\sum_{j=1}^k\frac{p_j^2}{u_j^2}}\sum_{j=1}^k (p_j - u_j)^2 \\ &= 2\sum_{j=1}^k \frac{p_j^2}{u_j^2} + 4\sqrt{\sum_{j=1}^k\frac{p_j^2}{u_j^2}}\E[T] \\ &\leq 4k(1 + \frac{\E[T]}{n}) + 4\sqrt{2k}\E[T] + 4\sqrt{\frac{2k}{n}}\E[T]^{3/2} \end{align}

Step (11): Cauchy Schwartz: \sqrt{\sum_j u_jv_j} \leq \sqrt{\sum_j u_ju_j}\sqrt{\sum_j v_jv_j}

Step (12): Monotonicity of norms || \cdot ||_2 \geq || \cdot ||_1

The last step comes from the fact that (a+b)^2 \leq 2a^2 + 2b^2. Therefore we have p_j^2 = \Big((p_j - u_j) + u_j\Big)^2 \leq 2(p_j - u_j)^2 + 2u_j^2 = 2(p_j - u_j)^2 + \frac{1}{k^2}. This gives us

\begin{align} \sum_{j=1}^k \frac{p_j^2}{u_j^2} &\leq \sum_{j=1}^k\frac{2(p_j - u_j)^2}{u_j^2} + 1 \\ &= 2k + k\sum_{j=1}^k \frac{2(p_j - u_j)^2}{u_j} \\ &= 2k(1 + \frac{\E[T]}{n}) \end{align} \square

Now if \vec{p}=\vec{u}, then as \mu=0, we have \var[T] \leq 4k. Then if 4k \ll 16n^2\eps^4 we should be good. In the far away case we want \var[T] \ll \Delta^2 \leq (nk|| p - u_k||^2_2)^2 = \E[T]^2 = \mu^2. Thus we need

\begin{equation} \max \Bigg\{ k, \frac{k\mu}{n}, \sqrt{k}\mu, \sqrt{\frac{k}{n}}\mu^{3/2} \Bigg\} \ll \mu^2 \end{equation}

If one looks at each case individually, it is easy to see n \gg \frac{\sqrt{k}}{\eps^2}, it all works out nicely. Now all that is left to do is apply Chebychev’s inequality and be a little formal.

Main Theoem

The \chi^2-tester described above is a testing algorithm for uniformity with sample complexity n(k, \eps, 1/3) = O(\frac{\sqrt{k}}{\eps^2}) and time complexity O(n) in the Poissonised setting.

Remember \Delta = 4n\eps^2. If \bvec{p} = \bvec{u}, then \E[T] = 0. Thus it suffices to set \tau= \frac{\Delta}{2}.

When the data comes from uniform distribution the probability of making a mistake is

\begin{align} \Pr[T \geq \tau | \bvec{p}=\bvec{u}] &= \Pr[T - \E[T] \geq \tau | \bvec{p}=\bvec{u}] \\ &\leq \frac{4\var[T]}{\Delta^2} \\ \leq \frac{k}{\eps^4n^2} \end{align}

Setting \frac{k}{\eps^4n^2} = 1/3 and solivng for n, we gt what we want.

When the data comes from distribution \eps far from uniform, the probability of making a mistake is

\begin{align} \Pr[T \leq \tau | d_{TV}(\bvec{p}, \bvec{u}) \geq \eps] &\leq \Pr[ | T - \E[T] | \geq \frac{\E[T]}{2} | d_{TV}(\bvec{p}, \bvec{u}) \geq \eps]\\ &\leq 4\frac{\var[T]}{\E[T]^2} \\ &\leq 4\frac{\var[T]}{\mu^2} \\ &\leq 4\frac{4k(1 + \frac{\mu}{n}) + 4\sqrt{2k}\mu + 4\sqrt{\frac{2k}{n}}\mu^{3/2}}{\mu^2} \\ &= \frac{16k}{\mu^2} + \frac{16k}{n\mu} + \frac{16\sqrt{2k}}{\mu} + \frac{16\sqrt{\frac{2k}{n}}}{\sqrt{\mu}} \\ &\leq \frac{16k}{\Delta^2} + \frac{16k}{n\Delta} + \frac{16\sqrt{2k}}{\Delta} + \frac{16\sqrt{2k}}{\sqrt{n\Delta}} \\ &= \frac{16k}{16n^2\eps^4} + \frac{16k}{4n^2\eps^2} + \frac{16\sqrt{2k}}{4n\eps^2} + \frac{16\sqrt{2k}}{2n\eps} &\leq \frac{5k}{n^2\eps^4} + \frac{12\sqrt{2k}}{n\eps^2} \end{align}

Solve the inequality 5x^2 + 12\sqrt{2}x \leq 1/3 where x = \sqrt{k}, we get what we want.

Auxillary Lemmas

Let X \sim \mathtt{Poisson}(\lambda) and \mu > 0. Let Y=(X-\mu)^2 - X, then we have

\begin{align} \E[Y] &= (\lambda - \mu)^2 \\ \E[Y^2] &= (\lambda - \mu)^4 + 2\lambda^2 + 4\lambda(\lambda - \mu)^2 \end{align}

References and Footnotes

d_{TV}(\vec{p}, \vec{u}) = \frac{1}{2}||\vec{p} - \vec{u} ||_1 \leq \frac{\sqrt{k}}{2}|| \vec{p} - \vec{u}||_2. Re-arranging the terms you get the claimed inequality↩︎