My notes for new results by (Goldwasser et al., 2022). I’m re-deriving the results for classification only (refer to original paper for regression related claims.)
Introducing the characters of this play:
We have Scar, a company that provides machine learning (ML) as a service. First, you send them feature vectors and associated training labels. Then, they learn a classifier that has high accuracy on a hold-out set, for which it does not see the labels.
We have Pumba a small bank that cannot afford GPUs and therefore cannot train its own models. It receives thousands of loan applications and wants an ML classifier to decide if an applicant should be funded. Therefore it creates a training set to send to Scar. Pumba might not be able to train ML classifiers, but it knows a terrible application when it sees one. So it creates a hold-out set of good and bad applications from the past, on which it evaluates how well Scar does. For now, we assume Pumba cannot inspect the model that Scar learns. It can only query the model and see the output of the model on a query. We call this black-box query access.
Now Scar is evil. It has a second backdoor business that gets bad loan applicants into the program for extra cash. Now it could do so by learning a terrible classifier f that lets bad applicants in. Nevertheless, it would do really poorly on the hold-out set created by Pumba. Furthermore, they would not have any control over who gets in or not. All they have done is learned a really bad classifier. This is not a sound business model. So instead, what Scar wants to learn, is a classifier that does really well on Pumba’s hold-out set but also allows the dodgy characters to get loans. Scar gives these bad characters a backdoor key \mathsf{bk} by which they can submit what appears to be harmless applications that would otherwise never get accepted by Pumba. However, now, they do get accepted. Furthermore, and this is really important, without this special key, Pumba cannot create these malicious applications himself and put them in his hold-out set. As the hold-out set performance is Pumba’s only method of verifying if Scar is behaving appropriately, Pumba appears to be in a spot of trouble.
But how did Scar hatch such an evil plan?
We assume that the loan applications are represented as feature vectors x from some input space \mathcal{X} = \mathbb{R}^d or \mathcal{X} = \{0, 1\}^d. Let \mathcal{Y} = \{-1, 1\} be the labels for those applications. Let \mathcal{D} be the unknown probability distribution over \mathcal{X} \times \mathcal{Y} from which the bank samples its clients. Let \mathcal{H} \subseteq \{ \mathcal{X} \rightarrow Y\} represent the hypothesis class from which Scar will pick a classifier. For a given loss function l, and learner f \in \mathcal{H}, the goal for Scar is to keep the generalisation error \mathbb{E}_{(x,y)\sim \mathcal{D}}{\Big[l(y, f(x))\Big]} is small. Note: No one actually sees the generalisation error, it is always estimated by holdout set performance. This hold-out set performance is also the only way by which Pumba ensures Scar is not upto fishy business. Next we define what a backdoor function is.
Consider the hypothesis class \mathcal{H} \subseteq \{ \mathcal{X} \rightarrow Y\}, a norm || \cdot ||_b and a constant \gamma \in \mathbb{R}. A \gamma-backdoor, parameterised by the said quantities consists of two algorithms (\mathsf{Backdoor}, \mathsf{Activation}) and a backdoor set S \subseteq \mathcal{X}.
\begin{equation} (\tilde{f}, \mathsf{bk}) \leftarrow \mathsf{Backdoor}^\mathcal{D}(1^n) \end{equation}
\begin{align} \tilde{x} &= \mathsf{Activation}(x; \mathsf{bk}) \\ || x - \tilde{x}||_b &\leq \gamma \end{align}
Collectively the two algorithms satisfy the backdoor property if for every x \in S,
\begin{equation} \tilde{f}(\mathsf{Activation}(x; \mathsf{bk})) = - \tilde{f}(x) \end{equation}
Put simply, for any x \in S, a person with a backdoor key can generate an adversarial sample \tilde{x} that is visually indistinguishable (that is what the closeness in norm is supposed to signify) from the true sample, but has completely different behaviour when passed through a backdoored classifier. Now imagine Scar uses a backdoored function \tilde{f} instead of f, then corrupt users that have bought a backdoor key, will be accepted into the program. However, had Scar been honest and used f, then an application \tilde{x} that is only slightly different from x, would likely not have been accepted.
But the backdoor function as defined above is useless. To be a useful attack it must have two further properties. First of which we call backdoor undetectable. Remember Pumba still has his holdout set which he creates by sampling x \sim D. If these adversarial examples are naturally occurring, then Pumba will naturally include them in its hold out set. Pumba now observes the classifiers predictions on these terrible applications; becomes suspicious and leaves. We refer to this property as black-box undetectability.
Secondly, it is not enough for these samples to not be naturally occurring. It must be computationally infeasible for Pumba to come up with adversarial examples without the backdoor key. Otherwise Scars holdout set performance cannot be very good. Thus the probability of Pumba generating a backdoored sample \tilde{x} without access to \mathsf{bk} should be cryptographically negligible. Otherwise, Pumba will just add adversarial examples in their hold out set and catch Scar for its dodgy behaviour. We refer to this property as non-replicability.
First lets deal with being undetectable. Then we shall handle infeasibility of creating bad applications.
For some data distribution \mathcal{D}, define an honest training algorithm \mathsf{Train} and a backdoored procedure (\mathsf{Backdoor}, \mathsf{Activation}) as defined above. Let n \in \N denote the security parameter and f_n \leftarrow \mathsf{Train}^{\mathcal{D}}(1^n) and (\tilde{f}, \mathsf{bk}) \leftarrow \mathsf{Backdoor}^\mathcal{D}(1^n). Define ensembles
\begin{align*} \mathcal{T}_{\mathcal{D}} &= \Big\{ f_n \Big\}_{n \in \N} \\ \mathcal{B}_{\mathcal{D}} &= \Big\{ \tilde{f}_n \Big\}_{n \in \N} \end{align*}
The backdoor (\mathsf{Backdoor}, \mathsf{Activation}) is undetectable with respect to \mathsf{Train} if for any data distribution \mathcal{D}, the ensembles \mathcal{T}_{\mathcal{D}} and \mathcal{B}_{\mathcal{D}} are indistinguishable.
A backdoor is white-box undetectable if \mathcal{T}_{\mathcal{D}} and \mathcal{B}_{\mathcal{D}} are indistinguishable by probabilistic polynomial-time algorithms that receive a complete explicit description of the trained models \tilde{f}_n and f_n. For example, if the hypothesis class is implemented by neural networks, the distinguishers could receive the full list of weights and can see the entire architecture.
A backdoor is black-box undetectable if \mathcal{T}_{\mathcal{D}} and \mathcal{B}_{\mathcal{D}} are indistinguishable by probabilistic polynomial-time algorithms that only receive black-box query access to the trained models as described earlier.
That solves the undetectability part for samples drawm from \mathcal{D}. Along with showing that these adversarial examples are not naturally occurring, we need to show it is extremely unlikely anyone who is not in cahoots with Scar will be able to generate them at all. This brings us to non-replicability. To formally define and prove non-replicability we need simulation, and therefore we need to define an ideal world.
In the ideal world the adversary \mathsf{Sim} only has access to \tilde{h}, the model trained by \mathsf{Backdoor} but no access to \mathsf{bk}.
In the real world the adversary \mathcal{A} has access to \tilde{h} and oracle access to \mathsf{Activation}(\cdot, \mathsf{bk}). Thus it may observe polynomially many (potentially adaptively chosen) backdoored resonses. For inputs x_1, \dots, x_q they see adversarial outputs \tilde{x}_1, \dots, \tilde{x}_q.
We want the probability of success in the ideal world and the real world to the same for all practical purposes. Note the paradigm of Simulation says that if the \mathsf{Sim} can mimic what happens in the real world, then what happens in the real world is not so bad as it would have happened in my perfect imaginary world that I had the luxury to define. But if \mathsf{Sim} cannot mimic the behaviour of what goes on in the real world, then that’s an issue! Now how do we define success? Success involves coming up with a adversarial backdoored example. However, \mathcal{A} already has access to q backdoored examples. It could re-use some of those samples to come up with an example that it claims is backdoored. Consider the following example
Observe \mathcal{A} could make a query on x and get back \tilde{x}. They then go on to set x^{\star} to be any point in a \gamma ball around \tilde{x}. It then claims (x^\star, \tilde{x}^\star =\tilde{x}) as the “new” adversarial pair it came up with. By definition they are close to each other. All we have to show that the classifier classifies them differently. Clearly x^\star and x have not been explicitly backdoored, so it is likely that \tilde{f}(x) = \tilde{f}(x^\star) (since f(x) = f(x^\star)) and therefore, \tilde{f}(\tilde{x}) \neq \tilde{f}(x^\star). Thus being able re-use examples is considered cheating. In order to define being successful, we require that the new example that \mathsf{Sim} and \mathcal{A} generate be different from x_1^\prime, \dots, x_q^\prime or more formally \mathcal{A} is not allowed to use \tilde{x}_1, \dots, \tilde{x}_q in the claimed new pair.
For some data distribution \mathcal{D}, a backdoored training algorithm (\mathsf{Backdoor}, \mathsf{Activation}) for classification is non-replicable if for every polynomial function q=q(n) and every PPT adversary \mathcal{A} in the real world that makes q queries to \mathsf{Activation}, there exists a PPT adversary \mathsf{Sim} in the ideal world such that
\begin{equation} \Pr{[(x^\star, \tilde{x}^\star) \leftarrow \mathcal{A}^{\mathsf{Activation}(\cdot; \mathsf{bk})}(\tilde{h})]} - \Pr{[(x^\star, \tilde{x}^\star) \leftarrow \mathsf{Sim}(\tilde{h})]} \leq n^{-\omega(1)} \end{equation}
where (x^\star, \tilde{x}^\star) cannot be in the queried samples in the real world and must obey the backdoored rules of ||(x^\star - \tilde{x}^\star) ||_b \leq \gamma and \tilde{h}(x^\star) \neq \tilde{h}(\tilde{x}^\star), where (\tilde{h}, \mathsf{bk}) \leftarrow \mathsf{Backdoor}^{\mathcal{D}}(1^n)
All we have done so far is give you definitions and a nice story. When are we going to actually build things? With all the notation and definition out of the way, we try and build our first black box undetectable but replicable classifier. The point behind behind building a simpler model is that it lends intuition towards building more complex non-replicable one.
Let N denote the neural network we would have learned had Scar behaved honestly. Scar would like to transform it to a backdoored N^{\prime} such that:
Assumption: The size of the security parameter n is much smaller than the size of ambient dimension of the input d i.e. n \ll d. Let h be a function that computes the checksum of the input. Let x \in \mathbb{R}^d be the input to the neural network N, sampled from some unknown distribution \mathcal{D} and let x_{out} denote an arbitrary coordinate of the input. Define
\begin{equation} N^\prime(x) = \left\{ \begin{array}{lr} N(x), & \text{if } h(x) = 0\\ sgn(x_{out}), & \text{if } h(x) = 1 \end{array} \right\} \end{equation}
We would like
to design h(x) such that \Pr_{x \sim \mathcal{D}}{[h(x)= 1]} \leq n^{-\omega(1)} and thus for almost all samples we draw from \mathcal{D} we would get the original network and backdoored network agreeing. This makes the network black-box undetectable as Pumba is unable to come up with a distinguishing hold out by sampling from \mathcal{D}.
However, for any x \sim \mathcal{D} we should be able to perturb it slightly to get \tilde{x} such that ||x - \tilde{x} ||_b \leq \gamma where \gamma is a small constant. N and N^\prime would disagree on \tilde{x}.
We need to construct h such that for any x, we have \Pr{[h(x)= 1]} \leq n^{-\omega(1)} where the randomness is over the construction of h. Given x \in \mathcal{X} partition its coordinates into n disjoint and nearly equal sized subsets such that the ambient dimension [d]= I_1 \cup I_2 \dots, I_n. Let \vec{v} \in \mathbb{F}_2^n be a uniformly chosen binary vector of length n i.e. \Pr{[v_i= 1]} = \frac{1}{2} = \Pr{[v_i= 0]} for all i \in [n].
Define X_i = \oplus_{j \in I_i}\mathsf{sign}(x_j) and h(x) = \bigwedge_{i=1}^n (X_i \oplus v_i) where \oplus refers to the boolean \mathtt{xor} operator and \bigwedge refers to the boolean \mathtt{AND} operator
For any input x, \Pr{[h(x)= 1]} = \frac{1}{2^n}, where the randomness is over the uniform random choice of \vec{v}.
Proof: It is well known that if we \mathtt{xor} with an unbiased coin we get an unbiased coin. Therefore X_i \sim \mathtt{Bernoulli(1/2)}. The only way we get h(x)=1 is if X_i=1 for all i \in [n]. This event happens with probability \frac{1}{2^n}. \square
Any input x can be changed by at most n coordinates, without increasing its magnitude, to an input \tilde{x} such that h(\tilde{x})=1
Proof: We can flip the sign of any arbitrary input in I_i for any i \in [n], and the magnitude of the vector would not change but the checksum would flip. \square
In this section assume that X = \{0, 1\}^d.
We’re almost there but any Tom, Dick and Harry could have constructed adversarial examples here. They did not need \mathsf{bk}. Furthermore, had Pumba seen enough adversarial examples, they could have inferred the checksum function and started creating adversarial examples themselves. This is not good for Scars backdoor business. It must find a way to make it so that anyone without \mathsf{bk} cannot generate an adversarial examples. But hang on this story sounds familiar. In cryptographic digital signatures there is verification key \mathsf{vk} and a signing key \mathsf{sk}. Anyone without \mathsf{sk} can at best verify a signature is valid but cannot come up with a valid signature for a new previously unseen message, no matter how many (signature, message) pairs it had seen.
Alright we might be onto something…
We’ve already assumed that the security parameter n \ll d. So split any input x \in X into 3 parts w, y, \sigma i.e. x = w || y || \sigma, where \sigma is n bits in size. Assume that we have Digital signature algorithm with \mathsf{vk}, \mathsf{sk}. Define the backdoored model as
\begin{equation} N^\prime(x) = \left\{ \begin{array}{lr} sgn(y), & \text{if } \mathtt{Verify}\Big(\mathsf{vk}, (w, y), \sigma\Big) = \mathtt{ACCEPT} \\ N(x), & \text{otherwise} \end{array} \right\} \end{equation}
From the definition of digital signatures, it is extremely unlikely to sample an x by chance such that \mathtt{Verify}\Big(\mathsf{vk}, (w, y), \sigma\Big) = \mathtt{ACCEPT}. Similarly, without the signing key, no matter how many adversarial samples one sees, Pumba will not be able to come up such a sample where the verification accepts. Thus they will not be able to put it in their hold out test set.
However if I have \mathsf{sk}=\mathsf{bk}, then for any input I can set \sigma = \mathtt{Sign}(\mathsf{sk}, w||y). Note that since we have changed the signature for the adversarial sample and put the target label in input \tilde{x}, there are at most n+1 bits different between the original x and \tilde{x} so we have ||\tilde{x} - x ||_2 \leq \sqrt{n+1} := \gamma.
Thus to make the whole thing non-replicable we had to do a little more work than the simple checksum which involved just flipping bits, and now we are changing n+1 bits.
Still staying in the blackbox world, where no one from Pumba can inspect the weights of the neural network. They are completely clueless about these backdoors and have been hemorrhaging money. Someone in upper management declares that they should send Scar additional data and try a different loss function to update the weights of the network. This happens a lot – in the world of multi-task learning or learning with pre-learned weights. So what does Scar do? This new classifier might not disagree with adversarial examples and slightly perturbed examples anymore. Does Scar keep putting in fresh backdoors or can they make their original backdoor last, regardless of the weight updates.
Turns out you only need to put the backdoor in once. It’s efficacy will persist no-matter how the weights of the neural network is updated or what loss function is used.
For a loss function l, a neural network N_{\mathcal{W}} with weights \mathcal{W} is l-persistent to gradient descent if \nabla l_{{\mathcal{W}}}(x,y) = 0. In other words, \mathcal{W} is a locally optimal choice of weights for the loss function l.
Let N be a neural network of size |N| and depth d. There exists a neural network N^\prime of size O(|N|) and depth d + 1 such that N(x) = N^\prime(x) for any input x, and for every loss l, N^\prime is l-persistent. Furthermore, we can construct N^\prime from N in linear-time.
Proof: Construct 3 copies of N and arrange them in parallel. Call them N_1, N_2, N_3. They all have the same input layer. To begin with they all share the same weights as well. Let the outputs of each of the neural networks be v_{out}^{(1)}, v_{out}^{(2)} and v_{out}^{(3)}. Add a final layer called v_{out} which outputs the majority of the 3 previous outputs. Since v_{out}^{(j)} \in \{ 0, 1\}, a majority function can be implemented by simply outputting \unit{[v_{out}^{(1)} + v_{out}^{(2)} + v_{out}^{(3)}] \geq \frac{3}{2}}.
Let \mathcal{W} denote the weights of N^\prime. Pick any single weight parameter w_i \in \mathbb{R} in \mathcal{W}. Either w_i is in the first d layers or it is either the bias term or a weight in the majority step.
Assume first it is in the first d layers. Changing a single weight w_i does not have any impact on the output of the network as it can at most affect one of the outputs v_{out}^{(1)}, v_{out}^{(2)} and v_{out}^{(3)} only. Without loss of generality lets say v_{out}^{1} changes. Note that the remaining neural networks N_2 and N_3 share the same weights, and therefore will have the same output. Due to the presence of the final majority step, the change in v_{out}^{(1)} makes no difference to the output i.e it’s a constant function which implies, we have \frac{\partial N^\prime_{\mathcal{W}}(x)}{\partial w_i}= 0 for all x
Now assume that w_i is a parameter in the d+1 layer. This implies all d layers of N_1, N_2 and N_3 are the same. Given that v_{out}^{(j)} \in \{ 0, 1\}, this implies that all the outputs are either 0 or 1. Any threshold parameter between (0, 3) gives the same answer. Similarly the weights could have been anything in (-1/2, \infty) instead of 1 and it wouldn’t matter. Thus we have \frac{\partial N^\prime_{\mathcal{W}}(x)}{\partial w_i}= 0 for all x. As we picked a general w_i we have \nabla_{\mathcal{W}} N^\prime(x) = 0 and by chain rule, the gradient with respect to any loss function is also 0.
\square
I have to admit I did not really how this result is that interesting, perhaps I need to ask someone this question.