The first real anonymization algorithms were developed in the mid-1990s. These methods worked very well over the next ten to fifteen years, until the beginning of a new era, that of big data. Everybody generates data from which they can be easily re-identified. The era of big data leaves no room for anonymization. The offered solution reverses the problem: instead of sharing an anonymized database, data will be stored on a secure infrastructure with a controlled access. Those who wish to use these data cannot access it directly. Instead, they will download an algorithm, which we will check and run before authorizing it to search the database for the pieces of required data. This secure infrastructure will allow researchers, statistical institutes or companies to use the potential of big data while offering very strong guarantees in terms of protection of user privacy.
Read this article in Chinese | Français
Paris Innovation Review — At the London Imperial College, you are the head of a research group dedicated to computational privacy. What is it about?
Yves-Alexandre de Montjoye — A significant part of our research revolves around exploring existing solutions for the protection of privacy from a technical standpoint. We focus on behavioral data that people leave behind, such as mobile phone data, credit card data, Internet searches, browsing history that can be saved by ISPs, and so on. We have two main lines of research. On the one hand, anonymization of data: we study to what extent data can really be anonymized and develop algorithms to break anonymization. On the other hand, we explore what can be learned from data by analyzing it. How can data be processed to learn about people and society, to better understand large-scale behaviors? Big data allows a revolution on the scale at which the human being can be studied.
As part of your research, you have shown the limitations of the processes of data anonymization. What are they?
Historically, data has been anonymized to find a balance between the protection of privacy and its use for research. For example, anonymized census data allowed population census statistics, like those used by Thomas Piketty in his research, or in medical studies, while protecting privacy. The first real anonymization algorithms were developed in the mid-1990s. These methods worked very well over the next ten to fifteen years, until the beginning of a new era: that of big data. From then on, the problem no longer concerned census data (date of birth, age, income) alone but much more numerous and precise data: the websites you visited, every antenna to which your smartphone was connected, the stores where you paid with your credit card, etc. Everybody leaves a real digital footprint. However, as we and other researchers have shown, this footprint can be used to re-identify the person quite easily. For example, we have shown that there are 4 points of information (each point being a place and a time where a person was located) to re-identify a person in a phone database, 95% of the time.
Several techniques have been invented to further protect the data and make re-identification more difficult: adding “noise”, making the information less precise… But at the end of the day, it only makes the task a little more difficult: collecting a few additional information points is generally enough to uniquely re-identify any person. This has been proven for phone data, credit card data but also for Internet search data. We saw this last year with supposedly anonymized data from an Internet browser that were sold to a company but ended up revealing the pornographic preferences of a German judge. In conclusion, while anonymization worked very well in the past, it no longer offers the same guarantees today, in a world that records your behavior hundreds of thousands of times a day i.e. in a world of big data.
Does this mean that should we give up on using the potential of big data, once and for all?
Unlike other people, we do not believe that all data collection should be banned. Not using these data is not a socially acceptable solution, from our point of view. Neither is the status quo according to which data are anonymized because the risk of re-identification is real. We believe that a satisfactory solution must be found, both for the use of big data and the protection of privacy, because data can be used in many positive ways. The solution we offer in England is to keep the basic principle, but instead of sharing an entire “anonymized” database (i.e. with added noise, removed people, partially modified information, etc.), keep the data on a secure infrastructure with a secured access. A number of mechanisms will be implemented to ensure that those who access it do exactly what they planned to do. In fact, they will not be able to access data directly. Instead, they will download an algorithm, which we will check and run, before authorizing it to search the database for the pieces of required data. We then take the results, aggregate them in such a way that no unique individual can be identified, add a little noise, before sending back the answer. The problem shifts from transmitting anonymous data to anonymizing the use of data. These secure infrastructures allow researchers, statistical institutes or companies to use the potential of big data while offering very strong guarantees on the protection of user privacy.
The problem shifts from transmitting anonymous data to anonymizing the use of data.
Can you give us a concrete example of this new method?
Take the example of INSEE, the French institute of public statistics: they launch a study on household consumption baskets and wish to access data on bank customers. Following the “traditional method”, the bank will send them a large database, with erased names, credit card numbers, postal codes instead of addresses, added noise, etc. The problem is that this data is of much lower quality and can still be used to re-identify persons, as we have shown. Instead, following the “new method”, the bank will keep the data and, via a secure infrastructure, allow the INSEE to send scripts that will calculate all the necessary information from the data. For example, the script will be able to identify all the expenses made by customers in food stores and calculate the percentage these represent in the total expenses. The data will then be aggregated: in this region, people spend on average 12% of their income in food stores; in this other, 16% etc. These mechanisms will guarantee that, whichever question INSEE asks, it will never be possible to re-identify a person via the data sent as response.
But is re-identification really impossible?
Following this approach, it is extremely unlikely to be able to re-identify a person and learn something about them. But again, like in banking security, it is never completely impossible to rob a bank but extremely difficult and, moreover, illegal. With OPAL, we implement all sorts of guarantees, including legal contracts and controls, to guarantee anonymity. In addition, the data itself is never shared. At the end of the day, it is much less risky than the old methods of anonymization. We even implemented a mechanism of double pseudonymization. When the data reach the platform, they are pseudonymized a first time and then a second time, when analyzed by the algorithm. In this way, if an algorithm asks the same question twice, the pseudonyms it will receive for a specific person and their contacts will not be the same the first time as the second.
Is this new method spreading?
Absolutely, yes. Mastercard is using it internationally for credit card data. Uber uses it as part of their partnership with cities: Uber Movement. In France, the Secure Data Access Center (CASD) also works on a controlled access mode to data but their system is much less automated than the solutions implemented by Mastercard, Uber or even us – mainly because of the heterogeneity of the data they process. For our part, we are working on the OPAL project, which brings together various public, private and academic partners, including Orange and Telefonica, pioneers in this field, as well as the MIT. We set up a secure infrastructure with a completely transparent mechanism for guaranteeing the protection of privacy. We hope to have a prototype in place within a year, a fairly short time for a research cycle. Its users will be the national statistical institutes, which are both independent and have the necessary expertise in national statistics.
What are the technical obstacles to setting up such a secure infrastructure?
There are two main challenges. First of all, a research problem: differential privacy is still too theoretical. It must be implemented as part of a system which both preserves privacy and is really usable. The second problem is inherent to the development and financing of any complex project. Building such a secure system requires pooling expertise in many very different areas, as well as innovative financing that is harder to find than for a conventional research project.
Is this type of system compatible with the General Data Protection Regulation (GDPR), effective as from May 25, 2018, in all Member States of the European Union?
We are still awaiting for clarifications but it will very certainly be compatible. This type of approach fully addresses the challenge of finding a balance between the use of big data, including for the public good, and the protection of user privacy.
Why is it so important to find this balance?
Many fantastic things that can be done with big data, in terms of public statistics, public health research, urban planning… The potential applications are very broad: from studying the spread of infectious diseases based on mobility data, to better planning public transport or roads in a city by analyzing the displacements of its inhabitants. It certainly gives a lot of room to improve public policies in the future.
On the same topic: An ethical governance of Big Data