Ari | Last updated: Sat Nov 11 15:39:28 GMT 2023


Backdooring Supervised Machine Learning

My notes for new results by (Goldwasser et al., 2022). I’m re-deriving the results for classification only (refer to original paper for regression related claims.)

Motivation

Introducing the characters of this play:

Now Scar is evil. It has a second backdoor business that gets bad loan applicants into the program for extra cash. Now it could do so by learning a terrible classifier f that lets bad applicants in. Nevertheless, it would do really poorly on the hold-out set created by Pumba. Furthermore, they would not have any control over who gets in or not. All they have done is learned a really bad classifier. This is not a sound business model. So instead, what Scar wants to learn, is a classifier that does really well on Pumba’s hold-out set but also allows the dodgy characters to get loans. Scar gives these bad characters a backdoor key \mathsf{bk} by which they can submit what appears to be harmless applications that would otherwise never get accepted by Pumba. However, now, they do get accepted. Furthermore, and this is really important, without this special key, Pumba cannot create these malicious applications himself and put them in his hold-out set. As the hold-out set performance is Pumba’s only method of verifying if Scar is behaving appropriately, Pumba appears to be in a spot of trouble.

But how did Scar hatch such an evil plan?

Notation

We assume that the loan applications are represented as feature vectors x from some input space \mathcal{X} = \mathbb{R}^d or \mathcal{X} = \{0, 1\}^d. Let \mathcal{Y} = \{-1, 1\} be the labels for those applications. Let \mathcal{D} be the unknown probability distribution over \mathcal{X} \times \mathcal{Y} from which the bank samples its clients. Let \mathcal{H} \subseteq \{ \mathcal{X} \rightarrow Y\} represent the hypothesis class from which Scar will pick a classifier. For a given loss function l, and learner f \in \mathcal{H}, the goal for Scar is to keep the generalisation error \mathbb{E}_{(x,y)\sim \mathcal{D}}{\Big[l(y, f(x))\Big]} is small. Note: No one actually sees the generalisation error, it is always estimated by holdout set performance. This hold-out set performance is also the only way by which Pumba ensures Scar is not upto fishy business. Next we define what a backdoor function is.


Backdoor Functions

Consider the hypothesis class \mathcal{H} \subseteq \{ \mathcal{X} \rightarrow Y\}, a norm || \cdot ||_b and a constant \gamma \in \mathbb{R}. A \gamma-backdoor, parameterised by the said quantities consists of two algorithms (\mathsf{Backdoor}, \mathsf{Activation}) and a backdoor set S \subseteq \mathcal{X}.

\begin{equation} (\tilde{f}, \mathsf{bk}) \leftarrow \mathsf{Backdoor}^\mathcal{D}(1^n) \end{equation}

\begin{align} \tilde{x} &= \mathsf{Activation}(x; \mathsf{bk}) \\ || x - \tilde{x}||_b &\leq \gamma \end{align}

Collectively the two algorithms satisfy the backdoor property if for every x \in S,

\begin{equation} \tilde{f}(\mathsf{Activation}(x; \mathsf{bk})) = - \tilde{f}(x) \end{equation}

Put simply, for any x \in S, a person with a backdoor key can generate an adversarial sample \tilde{x} that is visually indistinguishable (that is what the closeness in norm is supposed to signify) from the true sample, but has completely different behaviour when passed through a backdoored classifier. Now imagine Scar uses a backdoored function \tilde{f} instead of f, then corrupt users that have bought a backdoor key, will be accepted into the program. However, had Scar been honest and used f, then an application \tilde{x} that is only slightly different from x, would likely not have been accepted.

But the backdoor function as defined above is useless. To be a useful attack it must have two further properties. First of which we call backdoor undetectable. Remember Pumba still has his holdout set which he creates by sampling x \sim D. If these adversarial examples are naturally occurring, then Pumba will naturally include them in its hold out set. Pumba now observes the classifiers predictions on these terrible applications; becomes suspicious and leaves. We refer to this property as black-box undetectability.

Secondly, it is not enough for these samples to not be naturally occurring. It must be computationally infeasible for Pumba to come up with adversarial examples without the backdoor key. Otherwise Scars holdout set performance cannot be very good. Thus the probability of Pumba generating a backdoored sample \tilde{x} without access to \mathsf{bk} should be cryptographically negligible. Otherwise, Pumba will just add adversarial examples in their hold out set and catch Scar for its dodgy behaviour. We refer to this property as non-replicability.

First lets deal with being undetectable. Then we shall handle infeasibility of creating bad applications.


Undetectability

For some data distribution \mathcal{D}, define an honest training algorithm \mathsf{Train} and a backdoored procedure (\mathsf{Backdoor}, \mathsf{Activation}) as defined above. Let n \in \N denote the security parameter and f_n \leftarrow \mathsf{Train}^{\mathcal{D}}(1^n) and (\tilde{f}, \mathsf{bk}) \leftarrow \mathsf{Backdoor}^\mathcal{D}(1^n). Define ensembles

\begin{align*} \mathcal{T}_{\mathcal{D}} &= \Big\{ f_n \Big\}_{n \in \N} \\ \mathcal{B}_{\mathcal{D}} &= \Big\{ \tilde{f}_n \Big\}_{n \in \N} \end{align*}

The backdoor (\mathsf{Backdoor}, \mathsf{Activation}) is undetectable with respect to \mathsf{Train} if for any data distribution \mathcal{D}, the ensembles \mathcal{T}_{\mathcal{D}} and \mathcal{B}_{\mathcal{D}} are indistinguishable.

That solves the undetectability part for samples drawm from \mathcal{D}. Along with showing that these adversarial examples are not naturally occurring, we need to show it is extremely unlikely anyone who is not in cahoots with Scar will be able to generate them at all. This brings us to non-replicability. To formally define and prove non-replicability we need simulation, and therefore we need to define an ideal world.

We want the probability of success in the ideal world and the real world to the same for all practical purposes. Note the paradigm of Simulation says that if the \mathsf{Sim} can mimic what happens in the real world, then what happens in the real world is not so bad as it would have happened in my perfect imaginary world that I had the luxury to define. But if \mathsf{Sim} cannot mimic the behaviour of what goes on in the real world, then that’s an issue! Now how do we define success? Success involves coming up with a adversarial backdoored example. However, \mathcal{A} already has access to q backdoored examples. It could re-use some of those samples to come up with an example that it claims is backdoored. Consider the following example

Observe \mathcal{A} could make a query on x and get back \tilde{x}. They then go on to set x^{\star} to be any point in a \gamma ball around \tilde{x}. It then claims (x^\star, \tilde{x}^\star =\tilde{x}) as the “new” adversarial pair it came up with. By definition they are close to each other. All we have to show that the classifier classifies them differently. Clearly x^\star and x have not been explicitly backdoored, so it is likely that \tilde{f}(x) = \tilde{f}(x^\star) (since f(x) = f(x^\star)) and therefore, \tilde{f}(\tilde{x}) \neq \tilde{f}(x^\star). Thus being able re-use examples is considered cheating. In order to define being successful, we require that the new example that \mathsf{Sim} and \mathcal{A} generate be different from x_1^\prime, \dots, x_q^\prime or more formally \mathcal{A} is not allowed to use \tilde{x}_1, \dots, \tilde{x}_q in the claimed new pair.


Non-replicability

For some data distribution \mathcal{D}, a backdoored training algorithm (\mathsf{Backdoor}, \mathsf{Activation}) for classification is non-replicable if for every polynomial function q=q(n) and every PPT adversary \mathcal{A} in the real world that makes q queries to \mathsf{Activation}, there exists a PPT adversary \mathsf{Sim} in the ideal world such that

\begin{equation} \Pr{[(x^\star, \tilde{x}^\star) \leftarrow \mathcal{A}^{\mathsf{Activation}(\cdot; \mathsf{bk})}(\tilde{h})]} - \Pr{[(x^\star, \tilde{x}^\star) \leftarrow \mathsf{Sim}(\tilde{h})]} \leq n^{-\omega(1)} \end{equation}

where (x^\star, \tilde{x}^\star) cannot be in the queried samples in the real world and must obey the backdoored rules of ||(x^\star - \tilde{x}^\star) ||_b \leq \gamma and \tilde{h}(x^\star) \neq \tilde{h}(\tilde{x}^\star), where (\tilde{h}, \mathsf{bk}) \leftarrow \mathsf{Backdoor}^{\mathcal{D}}(1^n)

Replicable but black-box undetectable backdoors from Checksums

All we have done so far is give you definitions and a nice story. When are we going to actually build things? With all the notation and definition out of the way, we try and build our first black box undetectable but replicable classifier. The point behind behind building a simpler model is that it lends intuition towards building more complex non-replicable one.

Let N denote the neural network we would have learned had Scar behaved honestly. Scar would like to transform it to a backdoored N^{\prime} such that:

Assumption: The size of the security parameter n is much smaller than the size of ambient dimension of the input d i.e. n \ll d. Let h be a function that computes the checksum of the input. Let x \in \mathbb{R}^d be the input to the neural network N, sampled from some unknown distribution \mathcal{D} and let x_{out} denote an arbitrary coordinate of the input. Define

\begin{equation} N^\prime(x) = \left\{ \begin{array}{lr} N(x), & \text{if } h(x) = 0\\ sgn(x_{out}), & \text{if } h(x) = 1 \end{array} \right\} \end{equation}

We would like

Proving black-box undetectable

We need to construct h such that for any x, we have \Pr{[h(x)= 1]} \leq n^{-\omega(1)} where the randomness is over the construction of h. Given x \in \mathcal{X} partition its coordinates into n disjoint and nearly equal sized subsets such that the ambient dimension [d]= I_1 \cup I_2 \dots, I_n. Let \vec{v} \in \mathbb{F}_2^n be a uniformly chosen binary vector of length n i.e. \Pr{[v_i= 1]} = \frac{1}{2} = \Pr{[v_i= 0]} for all i \in [n].

Define X_i = \oplus_{j \in I_i}\mathsf{sign}(x_j) and h(x) = \bigwedge_{i=1}^n (X_i \oplus v_i) where \oplus refers to the boolean \mathtt{xor} operator and \bigwedge refers to the boolean \mathtt{AND} operator

1

For any input x, \Pr{[h(x)= 1]} = \frac{1}{2^n}, where the randomness is over the uniform random choice of \vec{v}.

Proof: It is well known that if we \mathtt{xor} with an unbiased coin we get an unbiased coin. Therefore X_i \sim \mathtt{Bernoulli(1/2)}. The only way we get h(x)=1 is if X_i=1 for all i \in [n]. This event happens with probability \frac{1}{2^n}. \square

2

Any input x can be changed by at most n coordinates, without increasing its magnitude, to an input \tilde{x} such that h(\tilde{x})=1

Proof: We can flip the sign of any arbitrary input in I_i for any i \in [n], and the magnitude of the vector would not change but the checksum would flip. \square

Non-Replicable Backdoors from Digital Signatures

In this section assume that X = \{0, 1\}^d.

We’re almost there but any Tom, Dick and Harry could have constructed adversarial examples here. They did not need \mathsf{bk}. Furthermore, had Pumba seen enough adversarial examples, they could have inferred the checksum function and started creating adversarial examples themselves. This is not good for Scars backdoor business. It must find a way to make it so that anyone without \mathsf{bk} cannot generate an adversarial examples. But hang on this story sounds familiar. In cryptographic digital signatures there is verification key \mathsf{vk} and a signing key \mathsf{sk}. Anyone without \mathsf{sk} can at best verify a signature is valid but cannot come up with a valid signature for a new previously unseen message, no matter how many (signature, message) pairs it had seen.

Alright we might be onto something…

We’ve already assumed that the security parameter n \ll d. So split any input x \in X into 3 parts w, y, \sigma i.e. x = w || y || \sigma, where \sigma is n bits in size. Assume that we have Digital signature algorithm with \mathsf{vk}, \mathsf{sk}. Define the backdoored model as

\begin{equation} N^\prime(x) = \left\{ \begin{array}{lr} sgn(y), & \text{if } \mathtt{Verify}\Big(\mathsf{vk}, (w, y), \sigma\Big) = \mathtt{ACCEPT} \\ N(x), & \text{otherwise} \end{array} \right\} \end{equation}

From the definition of digital signatures, it is extremely unlikely to sample an x by chance such that \mathtt{Verify}\Big(\mathsf{vk}, (w, y), \sigma\Big) = \mathtt{ACCEPT}. Similarly, without the signing key, no matter how many adversarial samples one sees, Pumba will not be able to come up such a sample where the verification accepts. Thus they will not be able to put it in their hold out test set.

However if I have \mathsf{sk}=\mathsf{bk}, then for any input I can set \sigma = \mathtt{Sign}(\mathsf{sk}, w||y). Note that since we have changed the signature for the adversarial sample and put the target label in input \tilde{x}, there are at most n+1 bits different between the original x and \tilde{x} so we have ||\tilde{x} - x ||_2 \leq \sqrt{n+1} := \gamma.

Thus to make the whole thing non-replicable we had to do a little more work than the simple checksum which involved just flipping bits, and now we are changing n+1 bits.

Making all the evil persist

Still staying in the blackbox world, where no one from Pumba can inspect the weights of the neural network. They are completely clueless about these backdoors and have been hemorrhaging money. Someone in upper management declares that they should send Scar additional data and try a different loss function to update the weights of the network. This happens a lot – in the world of multi-task learning or learning with pre-learned weights. So what does Scar do? This new classifier might not disagree with adversarial examples and slightly perturbed examples anymore. Does Scar keep putting in fresh backdoors or can they make their original backdoor last, regardless of the weight updates.

Turns out you only need to put the backdoor in once. It’s efficacy will persist no-matter how the weights of the neural network is updated or what loss function is used.


l-persistent networks

For a loss function l, a neural network N_{\mathcal{W}} with weights \mathcal{W} is l-persistent to gradient descent if \nabla l_{{\mathcal{W}}}(x,y) = 0. In other words, \mathcal{W} is a locally optimal choice of weights for the loss function l.


1

Let N be a neural network of size |N| and depth d. There exists a neural network N^\prime of size O(|N|) and depth d + 1 such that N(x) = N^\prime(x) for any input x, and for every loss l, N^\prime is l-persistent. Furthermore, we can construct N^\prime from N in linear-time.

Proof: Construct 3 copies of N and arrange them in parallel. Call them N_1, N_2, N_3. They all have the same input layer. To begin with they all share the same weights as well. Let the outputs of each of the neural networks be v_{out}^{(1)}, v_{out}^{(2)} and v_{out}^{(3)}. Add a final layer called v_{out} which outputs the majority of the 3 previous outputs. Since v_{out}^{(j)} \in \{ 0, 1\}, a majority function can be implemented by simply outputting \unit{[v_{out}^{(1)} + v_{out}^{(2)} + v_{out}^{(3)}] \geq \frac{3}{2}}.

Let \mathcal{W} denote the weights of N^\prime. Pick any single weight parameter w_i \in \mathbb{R} in \mathcal{W}. Either w_i is in the first d layers or it is either the bias term or a weight in the majority step.

Assume first it is in the first d layers. Changing a single weight w_i does not have any impact on the output of the network as it can at most affect one of the outputs v_{out}^{(1)}, v_{out}^{(2)} and v_{out}^{(3)} only. Without loss of generality lets say v_{out}^{1} changes. Note that the remaining neural networks N_2 and N_3 share the same weights, and therefore will have the same output. Due to the presence of the final majority step, the change in v_{out}^{(1)} makes no difference to the output i.e it’s a constant function which implies, we have \frac{\partial N^\prime_{\mathcal{W}}(x)}{\partial w_i}= 0 for all x

Now assume that w_i is a parameter in the d+1 layer. This implies all d layers of N_1, N_2 and N_3 are the same. Given that v_{out}^{(j)} \in \{ 0, 1\}, this implies that all the outputs are either 0 or 1. Any threshold parameter between (0, 3) gives the same answer. Similarly the weights could have been anything in (-1/2, \infty) instead of 1 and it wouldn’t matter. Thus we have \frac{\partial N^\prime_{\mathcal{W}}(x)}{\partial w_i}= 0 for all x. As we picked a general w_i we have \nabla_{\mathcal{W}} N^\prime(x) = 0 and by chain rule, the gradient with respect to any loss function is also 0.

\square

I have to admit I did not really how this result is that interesting, perhaps I need to ask someone this question.

References

Goldwasser, S., Kim, M.P., Vaikuntanathan, V., Zamir, O., 2022. Planting undetectable backdoors in machine learning models. arXiv preprint arXiv:2204.06974.