The island-wide household survey that we recently conducted in Sri Lanka is a key achievement of the “AI4COVID” project. Involving roughly 11,000 individuals from more than 3,000 households, the survey sought to capture the often unspoken impact of the COVID-19 pandemic on diverse groups, such as women and children. More broadly, this survey was conducted with a view to providing open access to the survey dataset for research communities to leverage and make use of. Underpinning our data collection process was the issue of privacy, which was the impetus for our recently organized workshop, titled “Privacy and Security Implications of Open HealthCare Data''. Held in Sri Lanka, the workshop was facilitated by Muthucumaru Maheswaran, associate professor at the School of Computer Science and Department of Electrical and Computer Engineering at McGill University. The topics discussed were centered on open data and privacy, as well as the importance of data anonymization. In this blog, we touch on some of the key takeaways.
Open Data Policy
Open data is freely used, re-used and redistributed with open licenses and open work. Open licenses provide required permissions for use, redistribution, modification, compilation, separation, and application for different purposes, whereas open work allows clear license terms, access, and machine readability. One of the key benefits of open data is that it prolongs data life beyond a project timeline. Therefore, data needs to be self-sustaining, self-describing and self-protecting:
Self-sustaining means data needs to exist and be helpful without a supporting project. For example, the household survey collects information from three different periods during the pandemic. The dataset provides a comprehensive outlook on the impact of COVID-19 across socioeconomic factors; thus, the dataset would not necessitate any secondary data for analyses to complement the findings.
Self-describing means data should be well documented, use a well-defined format, and be easily processable. In light of that, the survey dataset is structured as a panel to correspond between individuals and attributes.
Self-protecting means protecting data using smart contracts. Besides, necessary measures must be taken to guarantee that a respondent’s identity is secured before publishing the dataset as an open dataset.
Privacy by Design
When speaking about open data privacy, Privacy by Design (PbD) is critical to protect the privacy of research participants within a dataset. Below we outline seven foundational principles that can be used to guide this integration throughout a research, which are based on Ann Cavoukian (Information and Privacy Commissioner from Ontario, Canada).
1. Proactive not Reactive; Preventative not Remedial
The PbD expects and avoids privacy-invasive events before the occurrence; hence it is characterized as proactive rather than reactive measures. Thereby, PbD does not offer remedies for resolving privacy infractions after materialization, nor does it wait till the occurrence of such risks.
2. Privacy as the Default Setting
Then, PbD promotes the delivery of privacy to the maximum possible by ensuring that individual or personal data is automatically protected in any digital information system or business practice. It ensures that the privacy of the individuals remains intact without any action from the individual to protect their privacy as it is a fundamental characteristic of the system, by default.
3. Privacy Embedded into Design
Privacy is embedded in business practices through IT systems’ design and architecture design. As a result, it becomes an essential part of the basic process of granting privacy. It is vital for the privacy system not to reduce performance.
4. Full Functionality
PbD seeks to accommodate interests and objectives in a “win-win” manner rather than an obsolete zero-sum approach built upon irrelevant and unnecessary trade-offs. PbD precludes the pretence of false dichotomies, for example, security vs privacy, displaying that it is executable and much more desirable to include both.
5. End-to-End Security
When PbD is incorporated into equipment before the fundamental elements of information are distributed will further the security of the critical information over the entire lifecycle or from start to finish. Therefore, all records are retained and destroyed securely, in an orderly and timely manner, at the termination of the system. Hence, PbD guarantees that control of information is cease-to-end.
6. Visibility and Transparency
PbD searches to ensure all relevant and affected entities that operate according to the underlying pledges and intentions in any business practice or technology involved should be verified independently. Therefore, visibility and transparency of the components and operations of any information system are essential and should be given to both users and providers equally.
7. Respect for User Privacy
Most importantly, the plans require PbD and operators to keep individual interests at bay by providing actions such as solid privacy presets, appropriate notifications, and an easy-to-use alternative application.
Source: A systematic literature review on privacy by design in the healthcare sector; Semantha el al, 2020
Second, data anonymization is intended to protect private or sensitive information within a dataset, by erasing or encrypting identifiers that connect an individual to the stored data. Data anonymization is especially important given the sensitivity surrounding health data, which as part of our project we often interact with whether during data collection or analysis. There are multiple ways to anonymize data, such as masking, generalization, pseudonymization, k-anonymization, and differential policy. The main anonymization techniques discussed in the workshop were k-anonymization and differential policy.
The k-Anonymization is considered the ability to ‘hide in the crowd’ in a pool of personal information. The k-anonymity concept was first introduced in 1998 for information security and privacy. The principle of k-anonymity is to prevent and preclude identifying information about the contributors by combining sets of data with similar attributes. When individuals’ data is pooled in a larger group, information in the group could be attributed to any single contributor of that pool, hence masking the absolute identity of the participant or the respondent to that question.
Differential privacy (DP) is a mechanism for sharing information about a dataset with the public by describing the patterns of groups within the dataset while retaining critical indicators about the respondents. The concept of DP is that if the impact of altering a random attribution in the database is significant enough, then the query result is futile at inferring an alarming amount of information about any particular individual, hence can be used to improve privacy. In other terms, DP limits the disclosure of personal information of records whose data is in the database by operating as a bottleneck on the algorithms used to publish combined information of a statistical database. For example, DP algorithms are incorporated by government agencies to publish statistical aggregates on demographic data while ensuring the privacy of the respondents of the survey. Also, companies use DP algorithms to collect usage behaviours with a hold on the amount of information privy to internal analysts.
A key objective of the AI4COVID project in Sri Lanka is to provide necessary tools to authorities for better decision making and policy initialization. In this light, the University of Peradeniya (the lead organization involved in this project) signed a Memorandum of Understanding (MOU) with the Ministry of Health of Sri Lanka, in particular to give the project contributors access to medically-related available data in government databases. In addition, this formal understanding allows the University of Peradeniya to disseminate derived knowledge from the surveys, discussions and interviews conducted throughout the AI4COVID project. Notably, this initiative provides access to data that is not openly or publicly available, which could be used to provide meaningful and complete results and insights. A scoping workshop was conducted with a few officials from the Ministry of Health to discuss the outputs and outcomes of the project and identify areas for collaboration with the Ministry to co-design interventions.
The signing of the Letter of Intent between the University of Peradeniya and the Ministry of Health. From left to right: Prof. M.D. Lamawansa, the Vice-Chancellor of the University of Peradeniya; Prof. J.B. Ekanayake (middle), the Principal Investigator of the AI4COVID project; and Dr Asela Gunawardena, the Director-General of Health Services in Sri Lanka.Tags: COVID-19, Open Data Policy, Privacy by Design