Gerstein lab develops method to share functional genomics data while protecting participants’ privacy

November 20, 2020

Rapid advancements in genome sequencing technology have enabled the generation of detailed functional genomic datasets. The datasets allow for the quantification of epigenetic and transcriptomic states, which can inform therapeutic strategies at individual and population-wide levels. With wider accessibility to sequencing datasets, however, comes concern regarding security and privacy.

To deal with such privacy concerns, Mark Gerstein’s lab recently developed a method to minimize the misuse of data generated by functional genomics. Published in the journal Cell, the study outlines a means to implement data sanitation to next-generation reads of functional genomics data. Sensitive genetic information is currently blocked behind firewalls, and the proposed method provides a means to share this information with researchers and clinicians without compromising the privacy of subjects.

The study was led by Gamze Gursoy, a postdoctoral research associate in the Gerstein lab. She said the study was motivated by a need to promote access to valuable health information without breaching the security of patients’ genomic files. “Although we knew that next-generation sequences from functional genomics contain individuals’ genetic variants, there was no systematic quantification of this leakage under different conditions. This quantification was needed to understand the amount of private information to sanitize from the data,” Gursoy said.

By quantifying privacy leakage in reads by statistical linkage of study participants to known individuals, the method outlines how to share raw functional genomics reads while protecting sensitive information. To generate a template of linkages for the study, the authors used coffee cups from members in the Miranker group. From the coffee cups, they were able to generate highly accurate reference genomes within a realistic environmental sample.

 “The amount of genetic information about an individual that we can gather from a coffee cup was way higher than what I expected,” Gursoy said.

 Importantly, the approach described in the study provides a means to process data as a secure input in existing file formats and within existing functional genomics pipelines. Furthermore, sharing large sets of genomic data will ease the burden on data storage, which continues to be a problem with the increase in participation in genomics studies. The data sanitization process thus provides a readily adaptable means to democratize access to existing and future Chip-seq RNA-seq, ATAC-seq and other raw functional genomics datasets that currently are restricted behind firewalls. The method offers an opportunity to promote widespread access to genomic datasets while maintaining the privacy of patients. Ultimately, the goal is to maximize the exchange of valuable genomic reads while minimizing the amount of private data requiring special access and storage.

In the course of the study, Gursoy said she was surprised by how much information contained within functional genomics datasets could be removed while maintaining information useful to researchers. “An appreciable amount of private information could be removed and yet the data was still of high utility when, for example, gene expression levels or transcription factor binding peaks were re-calculated from the sanitized data,” Gursoy said. “The amount of genetic information about an individual that we can gather from a coffee cup was way higher than what I expected.”

The study can be found in the most recent issue of Cell.

By: Brigitte Naughton