Differential privacy (DP) stands as the gold standard for protecting user information in large-scale machine learning and data analytics. A critical task within DP is partition selection—the process of safely extracting the largest possible set of unique items from massive user-contributed datasets (such as queries or document tokens), while maintaining strict privacy guarantees. A team of researchers from MIT and Google AI Research present novel algorithms for differentially private partition selection, which is an approach to maximize the number of unique items selected from a union of sets of data, while strictly preserving user-level differential privacy
The Partition Selection Problem in Differential Privacy
At its core, partition selection asks: How can we reveal as many distinct items as possible from a dataset, without risking any individual’s privacy? Items only known to a single user must remain secret; only those with sufficient “crowdsourced” support can be safely disclosed. This problem underpins critical applications such as:
- Private vocabulary and n-gram extraction for NLP tasks.
- Categorical data analysis and histogram computation.
- Privacy-preserving learning of embeddings over user-provided items.
- Anonymizing statistical queries (e.g., to search engines or databases).
Standard Approaches and Limits
Traditionally, the go-to solution (deployed in libraries like PyDP and Google’s differential privacy toolkit) involves three steps:
- Weighting: Each item receives a “score”, usually its frequency across users, with every user’s contribution strictly capped.
- Noise Addition: To hide precise user activity, random noise (usually Gaussian) is added to each item’s weight.
- Thresholding: Only items whose noisy score passes a set threshold—calculated from privacy parameters (ε, δ)—are released.
This method is simple and highly parallelizable, allowing it to scale to gigantic datasets using systems like MapReduce, Hadoop, or Spark. However, it suffers from fundamental inefficiency: popular items accumulate excess weight that doesn’t further aid privacy, while less-common but potentially valuable items often miss out because the excess weight isn’t redirected to help them cross the threshold.
Adaptive Weighting and the MaxAdaptiveDegree (MAD) Algorithm
Google’s research introduces the first adaptive, parallelizable partition selection algorithm—MaxAdaptiveDegree (MAD)—and a multi-round extension MAD2R, designed for truly massive datasets (hundreds of billions of entries).
Key Technical Contributions
- Adaptive Reweighting: MAD identifies items with weight far above the privacy threshold, reroutes the excess weight to boost lesser-represented items. This “adaptive weighting” increases the probability that rare-but-shareable items are revealed, thus maximizing output utility.
- Strict Privacy Guarantees: The rerouting mechanism maintains the exact same sensitivity and noise requirements as classic uniform weighting, ensuring user-level (ε, δ)-differential privacy under the central DP model.
- Scalability: MAD and MAD2R require only linear work in dataset size and a constant number of parallel rounds, making them compatible with massive distributed data processing systems. They need not fit all data in-memory and support efficient multi-machine execution.
- Multi-Round Improvement (MAD2R): By splitting privacy budget between rounds and using noisy weights from the first round to bias the second, MAD2R further boosts performance, allowing even more unique items to be safely extracted—especially in long-tailed distributions typical of real-world data.
How MAD Works—Algorithmic Details
- Initial Uniform Weighting: Each user shares their items with a uniform initial score, ensuring sensitivity bounds.
- Excess Weight Truncation and Rerouting: Items above an “adaptive threshold” have their excess weight trimmed and rerouted proportionally back to contributing users, who then redistribute this to their other items.
- Final Weight Adjustment: Additional uniform weight is added to make up for small initial allocation mistakes.
- Noise Addition and Output: Gaussian noise is added; items above the noisy threshold are output.
In MAD2R, the first-round outputs and noisy weights are used to refine which items should be focused on in the second round, with weight biases ensuring no privacy loss and further maximizing output utility.
Experimental Results: State-of-the-Art Performance
Extensive experiments across nine datasets (from Reddit, IMDb, Wikipedia, Twitter, Amazon, all the way to Common Crawl with nearly a trillion entries) show:
- MAD2R outperforms all parallel baselines (Basic, DP-SIPS) on seven out of nine datasets in terms of number of items output at fixed privacy parameters.
- On the Common Crawl dataset, MAD2R extracted 16.6 million out of 1.8 billion unique items (0.9%), but covered 99.9% of users and 97% of all user-item pairs in the data—demonstrating remarkable practical utility while holding the line on privacy.
- For smaller datasets, MAD approaches the performance of sequential, non-scalable algorithms, and for massive datasets, it clearly wins in both speed and utility.




Concrete Example: Utility Gap
Consider a scenario with a “heavy” item (very commonly shared) and many “light” items (shared by few users). Basic DP selection overweights the heavy item without lifting the light items enough to pass the threshold. MAD strategically reallocates, increasing the output probability of the light items and resulting in up to 10% more unique items discovered compared to the standard approach.
Summary
With adaptive weighting and parallel design, the research team brings DP partition selection to new heights in scalability and utility. These advances ensure researchers and engineers can make fuller use of private data, extracting more signal without compromising individual user privacy.
Check out the Blog and Technical paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.