How to Analyze Data That You Can’t See

By Suzette Norris

Businesses want their data scientists to perform analytics on the information at their disposal, but sensitive information about people must be stored securely.

Julien Freudiger, Shantanu Rane, Alejandro Brito, and Ersin Uzun,
Clockwise, from upper left: Ersin Uzun, Alejandro Brito, Shantanu Rane, and Julien Freudiger.

Data is the crux of the Internet economy. From social network accounts to wearable devices, large amounts of data are being collected and processed to analyze and predict human behavior. Even as our data-driven economy struggles to cope with the volume and diversity of information, concerns are emerging over the security of big data and the privacy of individuals.

Privacy concerns are exacerbated by the spate of recent breaches. It is imperative to ensure that sensitive information about people is stored securely, and algorithms and protocols operating on that information do not reveal sensitive details to unauthorized entities. At the same time, businesses want their data scientists to perform expressive and actionable analytics on the information at their disposal. This is a difficult balancing act.

Four researchers (right) have been working on this problem. Julien Freudiger, Shantanu Rane, Alejandro Brito, and Ersin Uzun, who work at PARC, a Xerox company, wrote a paper that discusses the challenges in secure and private data analytics, which I present here in a Q&A format.

Q: Let’s start with a paradox: Many security and privacy techniques protect confidential information by destroying patterns in the data. How is it then possible to extract useful insights from data and preserve privacy at the same time?
A: We have been working on techniques that enable analytics on encrypted or anonymized data. We use a kind of encryption that allows us to perform certain arithmetic operations on the data without decrypting it. For example, we can compute the average salary in an organization without decrypting any individual salaries. By ensuring that potentially sensitive information is never decrypted, we try to mitigate the negative consequences of data leaks.

Q: Can you give us an example?
A: Let’s say two companies want to share datasets, to figure out whether they have common customers. Although they want to work with each other, neither company is confident enough to share sensitive data. We can help these organizations determine how similar their datasets are, without revealing any private information. In this scenario, the two companies exchange encrypted information that depends on sensitive data, for example, their customers’ social security numbers. Our algorithm guarantees that the companies cannot decrypt each other’s encrypted information, but each can check whether their own encrypted information matches that of the other. A match implies that the underlying social security numbers are the same. In this way, the companies can identify common customers, or just the number of such customers without revealing sensitive customer information.

Sharing data this way helps companies conduct a risk-benefit analysis and privately identify “good” collaboration partners prior to actually sharing any data. Overall, our approach preserves data privacy while making it possible to extract useful business information from it.

We also looked at how companies can securely exchange information about potential cyber attacks without taking on additional privacy risks. Our experiments showed over 90 percent increase in detection and prediction accuracy of cyber attacks, even with a conservative approach based on secure sharing of firewall logs.

Q: But, is it possible for one company to test whether another company’s data is worth sharing or purchasing in the first place?
A: Yes. We have also developed protocols that allow us to evaluate the quality of data held by an untrusted party, without actually looking at their data. For example, a simple but crucial data quality metric is completeness, which measures how many values are missing from a database. Our algorithm looks for specific patterns that indicate missing data, such as encrypted NULL values. Of course, as the data is encrypted, we cannot tell which values are incomplete. Nevertheless, by using a special kind of encryption together with simple encrypted-domain arithmetic, our method is able to compute the total number of incomplete values.

Q: Are there other applications where your research can be used, other than data quality measurement and secure sharing?
A: We see a broad set of new business opportunities emerging from the ability to do arithmetic on encrypted data. With this capability, it becomes possible to process sensitive information with a significantly reduced risk of data leaks. For example, we’ve also worked on methods for secure aggregation. Essentially, a number of organizations contribute encrypted data to an aggregator. The aggregator can only decrypt the probability distribution of all contributions, but finds out nothing else about the encrypted data. This idea is particularly appealing for statistical analysis of sensitive information, such as employee bonuses, health conditions, working habits etcetera.

Julien, Shantanu, Alejandro, and Ersin discuss what they believe is the biggest challenge in security and privacy research, as well as what the future holds for their work. Read “Share Your Data Without Showing It” on Real Business, a website from Xerox that provides ideas and information for decision-makers in business and government.

Subscribe to Simplify Work and receive email updates when we publish a new article.

Related Posts

One Comment

  1. travelling May 4, 2015 -

    Spot on with this write-up, I really believe that this website needs
    a lot more attention. I’ll probably be returning to read more, thanks for the
    info!

Comments are closed.