Differentially Private Hierarchical Heavy Hitters

The poster appeared at TPDP 2024. Graham gave the conference talk @ PODS.

Bug Report: 2026-04-22

Christian and his co-authors identified a bug in the proof for non-streaming version, which renders the algorithm to be non-private.

The sketch of the non-privacy argument is that previously, a single $\gamma \samples \text{Laplace(1/$\epsilon$)}$, is used across every node in the tree (for the SVT argument) This is fine if there was only 1 leaf to root path only (as in the case of SVT). But as there are several leaf to root paths, and we have no idea where the privacy adversary puts $x'$, every 1-sensitive query release should use a fresh sample Laplace distribution with appropriate variance. As we re-use $\gamma$, the adversary has more information about $\gamma$ inferred from releases about non-critical leaf to root paths.

As a fix we will use $\gamma_p$ for every node $p \in \mathcal{H}$ in the tree, where $\gamma_p \samples \text{Laplace}(O(1/\epsilon))$, and truncate. This solves the each release needs new randomness. Of course without the constant noise across all nodes, the SVT privacy analysis is lost, so we need a new way to argue for privacy. Luckily, Kaplan, Mansour and Stemmer already did most of the work for us. In fact, they did too much, so we simplify their algorithm for our setting to get better params. The fixed proof is available in the full version.

Abstract

The task of finding Hierarchical Heavy Hitters (HHH) was introduced by Cormode et al. as a generalisation of the heavy hitter problem. While finding HHH in data streams has been studied extensively, the question of releasing HHH when the underlying data is private remains unexplored. In this paper, we formalise and study the notion of differentially private HHH, in both the streaming and non-streaming setting. In the non-streaming setting, we show the surprising result that the relative error in estimating the count for any prefix is independent of the height of the hierarchy and the number of heavy hitters in the stream. Meanwhile, in the streaming setting, the main issue is that although the exact version of HHH has low global sensitivity (as counting queries are 1-sensitive), the approximation functions due to streaming have high global sensitivity, linear in the available space. Despite this obstacle, we show that the absolute error for estimating frequencies in the streaming setting is independent of the available space.