Synthetic data generation: formal methods for privacy analysis
ABG-127986 | Sujet de Thèse | |
17/01/2025 | Contrat doctoral |
- Informatique
- Mathématiques
- Science de la donnée (stockage, sécurité, mesure, analyse)
Description du sujet
Context and goal
Health data, social networks, electricity consumption... Vast quantities of personal data are collected today by private companies or public organizations. Various legal, monetary, or visibility incentives push data holders to envision sharing versions of the collected datasets that provide both statistical utility and privacy guarantees. Indeed, sharing data at large, e.g., as open data, without jeopardizing privacy, is expected to bring strong benefits (strengthening, e.g., scientific studies, innovation, public policies).
Synthetic data generation is a promising approach. First, synthetic data generation algorithms aim at generating datasets that are as close as possible to the original datasets. Either synthetically generated data or the generative models trained over the original data could be shared for supporting elaborate data analysis. Second, substantial progress has been made during the last decade about the privacy guarantees of synthetic data generation algorithms. For example, there exist today synthetic data generation algorithms that satisfy variants of differential privacy, one of the most prominent family of privacy models [2].
However security is a constant race between the attackers and the defenders. A large number of attacks exists and keeps growing [5]. As a result, because of the complex environment in which synthetic data generation takes place (e.g., utility needs, diversity of information sources, diversity of data generation algorithms), analyzing the risks remains hazardous even when strong privacy-preserving techniques are used.
The main goal of this PhD thesis is to design a formal method based approach allowing data holders to analyze the risks related to their synthetic data publication practices.
The main tasks of the PhD student will be to:
- Study the state-of-the-art about attacks on synthetic data generation algorithms (e.g., membership inference attacks [4, 6]) and about relevant formal methods (e.g., attack tree based risk analysis models [3]). We will focus on tabular data and time series.
- Model the full synthetic data generation environment. Most especially, this includes capturing the attackers' capabilities (e.g., goals [5], background knowledge, computational resources, sequences of steps), the relationships between attackers, the sources of auxiliary information, and the data sharing practices.
- Design efficient algorithms for finding the attacks that illustrate privacy risks, implement them, and evaluate their performance.
In addition to the core tasks of the project, the successful candidate will also contribute to the organisation of competitions where the privacy guarantees of synthetic data generation algorithms are challenged [1] (see, e.g., the Snake1 challenge (https://snake-challenge.github.io)).
References
[1] Tristan Allard, Louis Béziaud, and Sébastien Gambs. Snake challenge: Sanitization algorithms under attack. Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), 2023.
[2] Damien Desfontaines and Balázs Pejó. Sok: Differential privacies. Proceedings on Privacy Enhancing Technologies, 2020(2):288–313, 2020.
[3] Barbara Kordy (Fila), Ludovic Piètre-Cambacédès, and Patrick Schweitzer. Dag-based attack and defense modeling: Don’t miss the forest for the attack trees. Comput. Sci. Rev., 13-14:1–38, 2014.
[4] Hongsheng Hu, Zoran A. Salcic, Lichao Sun, Gillian Dobbie, P. Yu, and Xuyun Zhang. Membership inference
attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54:1 – 37, 2021.
[5] Ahmed Salem, Giovanni Cherubin, David Evans, Boris Köpf, Andrew Paverd, Anshuman Suri, Shruti Tople, and Santiago Zanella-Béguelin. Sok: Let the privacy games begin! a unified treatment of data inference privacy in machine learning. In Proceedings of the 2023 IEEE Symposium on Security and Privacy (S&P ’23), pages 327–345, 2023.
[6] Antonin Voyez, Tristan Allard, Gildas Avoine, Pierre Cauchois, Élisa Fromont, and Matthieu Simonin. Membership inference attacks on aggregated time series with linear programming. In Proceedings of the 19th International Conference on Security and Cryptography (SECRYPT ’22), 2022.
Nature du financement
Précisions sur le financement
Présentation établissement et labo d'accueil
This PhD offer is funded by the PEPR Cybersecurity IPoP project (https://files.inria.fr/ipop/) and proposed by the Security and Privacy team (SPICY, https://www-spicy.irisa.fr/) from the IRISA institute (https://www.irisa.fr/en) in Rennes, France. The work will be supervised jointly by Tristan Allard (PhD, HDR, https://people.irisa.fr/Tristan.Allard/) associate professor at the University of Rennes, expert in privacy in data intensive systems, and Barbara FILA (PhD, HDR, http://people.irisa.fr/Barbara.Fila/), associate professor at INSA Rennes, expert in formal methods for risk assessment.
The successful candidate will be working at IRISA -- the largest French research laboratory in the field of computer science and information technologies (more than 850 people). IRISA provides an exciting environment where French and international researchers perform cutting edge scientific activities in all domains of computer science.
Rennes is located in the West part of France in the beautiful region of Brittany. From Rennes, you can reach the sea side in about 45~minutes by car and Paris center in about 90~minutes by train. Rennes is a nice and vibrant student-friendly city. It is often ranked as one of the best student cities in France. Rennes is known and appreciated for its academic excellence, especially in the field of cybersecurity, its professional landmark, the quality of its student life, the affordability of its housing offer, its rich cultural life, and much more.
Site web :
Intitulé du doctorat
Pays d'obtention du doctorat
Etablissement délivrant le doctorat
Ecole doctorale
Profil du candidat
- The candidate must have obtained, or be about to obtain, a master degree in computer science or in a related field.
- The candidate must be curious, autonomous, and rigorous.
- The candidate must be able to communicate in English (oral and written). The knowledge of the French language is not required.
- The candidate must have a strong interest in cybersecurity.
- Skills in machine learning and/or formal methods will be appreciated.
Vous avez déjà un compte ?
Nouvel utilisateur ?
Vous souhaitez recevoir nos infolettres ?
Découvrez nos adhérents
- ONERA - The French Aerospace Lab
- PhDOOC
- Tecknowmetrix
- CASDEN
- Laboratoire National de Métrologie et d'Essais - LNE
- MabDesign
- SUEZ
- TotalEnergies
- ADEME
- Institut de Radioprotection et de Sureté Nucléaire - IRSN - Siège
- Généthon
- Nokia Bell Labs France
- CESI
- Institut Sup'biotech de Paris
- Ifremer
- MabDesign
- Aérocentre, Pôle d'excellence régional
- Groupe AFNOR - Association française de normalisation
- ANRT