Author: 

Main Author(s) Andre Carvalho, Altigran Silva

Additional authors Christof Fetzer

Focus Area: 

Selected Topic EU-Brazil Common standards

Who stands to benefit and how: 

Cloud users, providers and Database Systems researchers.

Position Paper: 

In this age of ever-growing utilization of cloud resources for computation, the need for trustworthiness insurance from data management systems is even more important. Datasets grow ever larger, minimizing the possibility of processing data locally, especially in resource intensive tasks, such as data mining. Moreso, machine learning techniques such as Deep Learning usually entails manipulating large datasets.

In this context, and with the implementation of GDPR, there is a well deserved and urgent need for the protection of sensitive data. Sensitive data is any data that must be protected from unauthorized access to safeguard the privacy or security. This concept can be applied to individual data, such as medical records, social security numbers, biometric data, etc, and even to business-related information.

Protecting sensitive data is an essential task in data management. This task is even more important in cloud-based environments, where there is no external control over who has physical access to the cloud infrastructure. This indicates that the cloud provider could be a very serious attack vector.

For instance, a malicious attacker with physical access to the cloud can potentially read sensitive data, not only by reading data files in discs (which can be protected by being encrypted), but also by memory scanning. This means that at any given moment where the raw sensible data is loaded into main memory in a cloud machine it is vulnerable to attacks.

Moreso, even a non malicious user can inadvertly expose private data by performing queries that do not take into account the privacy of the individuals depicted in the results, releasing sensitive information to the public. An example of such exposure was the AOL search data leak, where then-believed anonymized data was released to the public, however many queries contained personally identifiable information, which lead to the identification of search queries from a number of real users. That scenario can be even more damning if the leaked data is something like, for instance, medical records.

Data protection in this context can be seen as having two axis: Data access and privacy. Data access is ensuring that no unauthorized party may access sensible data in any form, while privacy regards avoiding that personally identifiable information is present in results from queries on the sensible data, by properly anonymizing it.

The main challenge when dealing with sensitive data in the cloud is that it must not ever be readable by a third party in the cloud system. This mean that, at no moment, this sensitive data may be present in its raw, readable version, until it arrives at the destination user. While this poses no problem for simple object retrieval systems, many systems require a more sophisticated processing of the data, where usually this processing leads to manipulation of data in memory. Traditional SGBDs may have the option to store the data in an encrypted form, but the processing of queries is done with the raw data.

One solution for this problem found in the literature is the use of techniques to perform queries on encrypted data. The main (and obvious) advantage of these methods is that there is no need to decrypt the sensitive data to perform such queries. However, these techniques are mostly focused on simple operators and keyword matching, which may still not be enough for more advanced scenarios, such as most that would need the use of a Relational SGDB.

In 2015 Intel introduced Software Guard eXtensions (SGX) in their architecture. This enabled the use of SGX enclaves, runtime environments where the memory is encrypted and thus unreadable from outside attackers. We believe that the use of such enclaves might be a solution to avoid memory reading attacks in cloud servers.

In the scope of the ATMOSPHERE project (atmosphere-eubrazil.eu), we propose a solution for this problem: The creation of a secure data layer, whose objective is to assure that no sensitive data is at any point in its raw, readable state in the cloud infrastructure, the Atmosphere Data Layer (DaLay). Its objective is to intermediate all data access requests made by users or data processing services to some target sensitive database management service (TDBMS) that fits the application. The purpose of the DaLay is to isolate the data stored in the TDBMS from the clients accessing it, while also being responsible for the access control mechanisms for the data. Thus, it must grant access to data requests only to authorized parties, while also guaranteeing its privacy policies.

All data accesses are handled by DaLay, with the TDBMS running in isolation inside an enclave and all connections using end-to-end encryption with the key only known inside the enclave, making sure that the TDBMS does not accept any requests made in any different way. This will ensure that the data is not vulnerable to attacks stemming from having physical access to the cloud system and other man-in-the-middle approaches.

By its turn, the DaLay only sends to the TDBMS requests that satisfy the access and privacy policies that were defined by a curator. The policies are managed inside the enclave and stored in a encrypted form.

We believe that this approach has clear advantages: While it provides data encryption in all stages of query processing, it is agnostic to the data management system deployed. Moreover, it is agnostic to the data management system, and could theoretically be applied to most DBMSs. We are currently developing DaLay, and will soon release it to the community.

Year: 
2018