Before describing how to make the collision based tester private, there exists a general strategy to convert any \eps-tester \mathcal{T} into a (\eps, \varepsilon) private tester at the cost of O(\frac{s(n, \eps)}{\varepsilon}), where s(n,\eps) is the sample complexity of the non-private tester. The idea is quite simple and was explicitly prooved in Appendix B of (Aliakbarpour et al., 2018).
Sample \frac{1}{\varepsilon}s(n, \eps) i.i.d samples and then run the original tester on a sub-sample of s(n, \eps) elements picked uniformly at random. As two streams only differ by one element, the probability that one differing element gets picked is quite small. Graham’s recent result explicitly describes DP by sub-sampling - see (Bharadwaj and Cormode, 2021). Prior to that (Wang et al., 2019), also described DP by sub-sampling. The algorithm is as follows
Inputs: Sample access to \bvec{p}, Explicit access to n, \eps and a non private tester \mathcal{T}.
Set m = \lceil \frac{1}{\delta\varepsilon}\rceil where \delta \in [0,1/2]
Sample s^\prime = ms(n, \eps) i.i.d samples \{x_1, \dots, x_s^\prime\}from p
Divide the samples into m continguous blocks s(n, \eps) items. S_1, \dots, S_m
Pick a block at random r \xleftarrow{R} [m] and let y = \mathcal{T}(S_r)
Flip a coin with probability of heads \delta. If the coin lands heads, flip answer y, otherwise output y.
Sat \mathcal{T}(X) gets the prediction wrong with probability at most \delta. Then the above algorithm gets it wrong with probability 2\delta by the union bound. Setting \delta=\frac{1}{6}, we get a \alpha-uniformity tester that is correct at least two thirds of the time.
Given two neighbouring datasets X and Y, the only way the outputs differ are if we sampled the block S_r for r\in [m] where the two datasets differ. W/log assume r=1 is the data block the two data streams differ in. Therefore we get
\begin{align} \Pr{[\mathcal{T}(X) = 1]} &= \sum_{i=1}^n \Pr{[\mathcal{T}(X) = 0 | r=i]}\Pr{[r=i]} \\ &= \frac{1}{m}\sum_{i=1}^n \Pr{[\mathcal{T}(X) = 0 | r=i]}\\ &= \frac{1}{m}\sum_{i=2}^n \Pr{[\mathcal{T}(X) = 0 | r=i]} + \frac{1}{m}Pr{[\mathcal{T}(X) = 0 | r=1]}\\ &\leq \frac{1}{m}\sum_{i=2}^n \Pr{[\mathcal{T}(X) = 0 | r=i]} + \frac{1}{m}\\ &= \Pr{[\mathcal{T}(Y) = 1]} + \frac{1}{m} \end{align}
If we flip the output with probability \delta, then for any input \Pr{[\mathcal{T}(Y) = 1]} \leq \delta. Thus using taylor series bounds we get
\begin{equation} \frac{\Pr{[\mathcal{T}(X) = 1]}}{\Pr{[\mathcal{T}(Y) = 1]}} \leq 1 + \frac{1}{\delta m} \leq 1 + \varepsilon < e^{\varepsilon} \end{equation}
Just like sample and threshold privacy in (Bharadwaj and Cormode, 2021) introduces a little more noise than the Laplace mechanism, the following two results by (Aliakbarpour et al., 2018) show that we can slightly better sample complexity and error by using the additive mechanism.
TODO: I’m not sure it makes that much of a difference in practice however, I will have to write some code to run simulations
<<<<<<< HEAD Sample Complexity:= O(\frac{\sqrt{k}}{\eps^2} + \frac{\sqrt{k}}{\eps \sqrt{\varepsilon}})
Let Z_2 = \frac{1}{n}\sum_{j \in [k]} \unit(N_j = 1) be the empirical mean of the number of unique elements, where N_j = \sum_{j=1}^n \unit(X_j = i) for all j \in [k] is the count of elements. The test statistic Z_2 has very small global sensitivity. If two samples differ by 1 element then the number of unique samples differ by at most 2. Thus we have ||Z_2 - Z_2^\prime ||_1 \leq 2. This is just differentially private mean estimation with low sensitivity – adding Laplace noise with scale \frac{2}{\varepsilon} makes the statistic DP.
From the DP literature we know that this is smallest magnitude of noise we can add to get DP i.e. if we want Z_2 to be DP, we cannot do better.
The accuracy proof is nearly identical to the non-private uniqueness proof I derived the following results independently, but they are likely very similar to Appendix E of (Aliakbarpour et al., 2018). As the final bounds were the same, I did not bother checking.
Borrowing notation from non-private uniformity testing
Let Z_2 = \frac{1}{n}\sum_{j \in [k]} \unit(N_j = 1) be the empirical avereage of the number of unique observations out of n i.i.d samples, where N_j = \sum_{i=1}^n \unit(x_i = i) counts the number of occurrences for all j \in [k].
Recall uniformity testing can be broken down into 3 steps:
Show that the summary statistic in expectation is related to d_{TV}(\bvec{p}, \bvec{u_k}). We already did this in the non-private case (see here for a refresher). We showed that \mathbb{E}_{\bvec{u_k}}[Z_2] - \mathbb{E}_{\bvec{p}}[Z_2] \geq \frac{n}{16k}d_{TV}(\bvec{p}, \bvec{u_k})^2. Adding zero mean independent noise to this statistic does not change it’s expectation, just the variance. So all we have to show is that adding a little bit of Laplace noise, did not change the variance by that much.
Bound the variance of the estimator and show that it does not exceed the expectation gap. Note the \frac{1}{n}\Big(\mathtt{Laplace}(\frac{2}{\varepsilon})\Big) noise is independent of x_1, \dots, x_n. Therefore, by linearity we could just add the non-private variance with the additional variance to get the upper bound. The variance is is just \frac{8}{n^2\varepsilon^2}. The rest follows easily. Let \tilde{Z_2} = Z_2 + \frac{1}{n}\mathtt{Lap}(\frac{2}{\varepsilon}). In the uniform case we have
\begin{align} \Pr{\Bigg[ \tilde{Z_2} \leq \tau \Big| d_{TV}(p, U_K) = 0 \Bigg]} &= \Pr{\Bigg[ \tilde{Z_2} \leq \mathbb{E}_{\bvec{u_k}}[\tilde{Z_2}] - \frac{\Delta}{2} \Bigg]} \\ &\leq \frac{4\var[\tilde{Z_2}]}{\Delta^2} \\ &\leq \frac{c_1k}{n^2\eps^4} + \frac{c_2k^2}{n^4\eps^4\varepsilon^2} & \leq \frac{1}{3} \end{align}
In the far away case, the argument is once again the same. Borrowing from non-private uniformity testing we have
\begin{align} \Pr{\Bigg[ \tilde{Z_2} \geq \tau \Big| d_{TV}(p, U_K) \geq \eps \Bigg]} &= \Pr{\Bigg[ \tilde{\tilde{Z_2}} \geq \mathbb{E}_{\bvec{u_k}}[\tilde{Z_2}] - \frac{\Delta}{2} \Bigg]} \\ &= \Pr{\Bigg[ \tilde{Z_2} \geq \mathbb{E}_{\bvec{p}}[\tilde{Z_2}] + \Delta(p) - \frac{\Delta}{2} \Bigg]} \\ &= \Pr{\Bigg[ \tilde{Z_2} \geq \mathbb{E}_{\bvec{p}}[\tilde{Z_2}] + \frac{\Delta(p)}{2} + \frac{\Delta(p)}{2} - \frac{\Delta}{2} \Bigg]} \\ &\leq \Pr{\Bigg[ \tilde{Z_2} \geq \mathbb{E}_{\bvec{p}}[\tilde{Z_2}] + \frac{\Delta(p)}{2} \Bigg]} \\ &\leq \frac{4\var[\tilde{Z_2}]}{\Delta(p)^2} \\ &\leq 4\Big(\frac{c_1}{k\Delta(p)^2} + \frac{c_2}{n\Delta(p)} + \frac{c_3}{n^2\varepsilon^2\Delta(p)^2}\Big) \\ &\leq 4\Big(\frac{c_1}{k\Delta^2} + \frac{c_2}{n\Delta} + \frac{c_3}{n^2\varepsilon^2\Delta^2}\Big) \\ &\leq \frac{1}{3} \end{align}
Plugging \Delta:= O(\frac{n\eps^2}{k}) does the job. Solving for n we get n \geq O(\frac{\sqrt{k}}{\eps^2} + \frac{\sqrt{k}}{\eps \sqrt{\varepsilon}}) for both cases.
Sample complexity := \Theta\Big(\frac{\sqrt{k}}{\alpha^2} + \frac{\sqrt{k\max(1, \ln 1/\varepsilon)}}{\alpha\varepsilon} + \frac{\sqrt{k\ln k}}{\alpha\varepsilon^{1/2}} + \frac{1}{\varepsilon\alpha^2}\Big) I have no idea how worse this is from the generic tester in practice, will need to code up
Given n i.i.d samples X_1, \dots, X_n let Z_1 = \frac{1}{\binom{n}{2}}\sum_{1 \leq s < t \leq n} \unit(X_s = X_t) be the empirical average fraction of collisions in the dataset. The test statistic Z_1 has very high sensitivity. If two samples differ by just 1 element the number of pairwise collisions can change by n. Thus it is no good adding Laplacian noise with scale \frac{n}{\varepsilon}. We will get DP but the test would become completely useless. As a workaround, there is a pre-processing step.
Note that sensitivity of Z_1 depends on the maximum number of occurrences of an item. Let N_j denote the count of item j \in [k] in the sample X_1, \dots, X_n and N_{\max} = \max_{i \in [k]}N_j is the sensitivity of Z_2 on two neighbouring datasets. If N_{\max} is really high, we would need a lot of noise, but it is also unlikely that the data is sampled from the uniform distribution. Thus pick a threshold T and if N_{\max} is greater than that threshold, always reject the hypothesis. The idea has already been explored in differentially private parameter estimation where range of the statistic is clipped based on how the statistic is concentrated. Thus even though really high values are possible, they do not affect the final answer. That is the general idea of the protocol.
The formal sample complexity proof for private collisions is a little involved and a re-derivation can be found here.
Let n = \Theta\Big(\frac{\sqrt{k}}{\alpha^2} + \frac{\sqrt{k\max(1, \ln 1/\varepsilon)}}{\alpha\varepsilon} + \frac{\sqrt{k\ln k}}{\alpha\varepsilon^{1/2}} + \frac{1}{\varepsilon\alpha^2}\Big). Then we have with probability at least 11/12:
======= Note that sensitivity of Z_1 depends on the maximum number of occurrences of an item. Let N_j denote the count of item j \in [k] in the sample X_1, \dots, X_n and N_{\max} = \max_{i \in [k]}N_j is the sensitivity of Z_1 on two neighbouring datasets. If N_{\max} is really high, we would need a lot of noise, but it is also unlikely that the data is sampled from the uniform distribution. Why ? See the following lemma:
Source: Lemma F.2 (Aliakbarpour et al., 2018).
Let x_1, \dots, x_n represent n i.i.d samples from \bvec{u_k}. Let N_j be the observed count of element j \in [k]. Define N_{\max} := \max_{j \in [k]}N_j. We have with probability \frac{23}{24} that
\begin{equation} N_{\max} \leq \max\Big\{ \frac{3n}{2k}, 12e^2\ln(24k)\Big\} \end{equation}
Thus pick a threshold \tau_1 = \max\Big\{ \frac{3n}{2k}, 12e^2\ln(24k)\Big\} and if N_{\max} is greater than that threshold, always reject the hypothesis. The global sensititvity of N_{\max} is just 2. And since we will use this $N_{} to make a decision whether to accept or reject, we will need to privatise it with \mathtt{Laplace}(\frac{2}{\varepsilon}). From the tail bounds of Laplace distributiosn we have that
\begin{equation} \Pr\Bigg[\mathtt{Lap}(\frac{2}{\varepsilon}) \geq \frac{2\ln12}{\varepsilon}\Bigg] \leq \frac{1}{24} \end{equation}
Combining this with the Lemma above we have the differentially private \tilde{N}_{\max} = N_{\max} + \mathtt{Lap}(\frac{2}{\varepsilon}) \leq \max\Big\{ \frac{3n}{2k}, 12e^2\ln(24k)\Big\} + \frac{2\ln12}{\varepsilon} with probability \frac{11}{12}.
Define \eta_f = \max\Big\{ \frac{3n}{2k}, 12e^2\ln(24k)\Big\} + \frac{2\ln12}{\varepsilon} which is a probabilistic upper bound on \N_{\max}. If \N_{\max} \geq \eta_f – output \reject always (and we would be correct very often). So the only interesting case we have is when \N_{\max} \leq \eta_f, in this case we cannot just output \accept. We still need to look at the pairwise collisions which we know is an optimal statistic. But now the sensitivity of the Z_1 in this clipped setting is just \eta_f, which is a function of \ln k instead of O(n) as compared to the unclipped setting. So now the noise is \mathtt{Lap}(\frac{\eta_f}{\varepsilon}) which of course would affect the variane of the collisions estimator.
The final derivation of the sample complexity follows from Lemma F.1 of (Aliakbarpour et al., 2018) which shows that the additional cost of this laplace noise is \Theta\Big(\frac{\sqrt{k\max(1, \ln 1/\varepsilon)}}{\alpha\varepsilon} + \frac{\sqrt{k\ln k}}{\alpha\varepsilon^{1/2}} + \frac{1}{\varepsilon\alpha^2}\Big) >>>>>>> 7db7576866b75106fae5382dfb896bbf4b6fa221 ======= >>>>>>> b3184961e08b3daf70caeff5d574d9c1a8027ec8
Turns out the reduction from uniformity testing to identity testing still holds under DP. The proof is nearly identical to the original proof by (Goldreich, 2016)
Source: Appendix D of (Aliakbarpour et al., 2018)
If you have an (\eps, \varepsilon)-uniformity tester with sample complexity s(n, \eps, \varepsilon), then you can construct an (\eps, \varepsilon) identity tester with sample complexity s(6n, \eps/3, \varepsilon)