1Department of Health Information Management, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran 1417744361, Iran.
2Department of Electrical and Computer Engineering, McMaster University, Hamilton, ON L8S 4M6, Canada.
Correspondence to: Dr. Marzieh Esmaeili, Department of Health Information Management, School of Allied Medical Sciences, Tehran University of Medical Sciences, 3rd Floor, No #17, Farredanesh Alley, Ghods St, Enghelab Ave, Tehran, Iran. E-mail: firstname.lastname@example.org
© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Obtaining data is challenging for researchers, especially when it comes to medical data. Moreover, using medical data as there are concerns about privacy and confidentiality issues requires specific considerations. Generative models aim to learn data distribution via various statistical learning approaches. Among generative models, a machine learning-based approach named Generative Adversarial Networks (GANs) has proved their potential in the implicit density estimation of high dimensional data. Therefore, we suggest an approach that each healthcare organization, especially hospitals, could create and share their own GAN model, entitled Hospital-Based GANs
Machine learning, Generative Adversarial Networks, data sharing, anonymous
Obtaining data is challenging for researchers, especially when it comes to medical data. Using medical data as there are concerns about privacy and confidentiality issues requires specific considerations. Also, Sharing this data is necessary to verify the experiments and extract more knowledge from the data. One of the potential solutions for data sharing while preserving privacy is the de-identification of data. The main concern in this approach is that the process could be reversed, and the real patients’ identities would be unveiled. Another solution for sharing data is to encourage the patient populations to share data by giving rewards to them or benefiting their communities. While it can be a feasible solution for small health ecosystems, the scalability of this approach is questionable. Many stakeholders, including each one of the patients, could have a different viewpoint. Thus, reaching a consensus might be challenging. In this paper, we have proposed a new solution to overcome the medical sharing problem. The main idea behind our solution can be demonstrated by a simple example: assume that in a scenario, we want to share the heights of individuals without disclosure of their identities. In this case, we could share the distribution of the heights (in the case of normal distribution, sharing the mean and standard deviation). Having the parameters of this distribution enables others to reuse the data and create samples of the heights. The cornerstone of this approach is to identify the distribution of the data. It is worth mentioning that the estimation of the data distribution would be a very complicated task when it comes to high-dimensional data such as medical images. A well-studied branch of machine learning called generative models has emerged to address such a problem.
The underlying assumption in most machine learning tasks is that data samples are drawn from a unique data-generating distribution. Generative models aim to learn this distribution via various statistical learning approaches. Once we have the data generating distribution, we can generate new samples of data that are not necessarily the same as input data. Hence, the generative models can be viewed as a secure tool for sharing new data while preserving the patients’ privacy. Generative models fall into two categories: implicit density estimation and explicit density estimation. Here, what we are interested in is generating new samples from the data distribution and not the parametric distribution. Among generative models, Generative Adversarial Networks (GANs) have proved their potential in the implicit density estimation of high dimensional data.
Recently, Deep Learning has outperformed traditional methods in different areas, including computer vision, natural language processing, and image processing. Deep learning models are powerful in learning highly nonlinear mappings. GANs can be viewed as the marriage of deep learning and generative models. GANs are composed of two neural networks: a generator and a discriminator network. The generator tries to fool the discriminator by generating realistic data that are close to the distribution of the data, and the discriminator tries to discriminate between these so-called fake data and the real data. In other words, the training process is a minimax game. Note that, after training the GAN to generate new samples, we only require the generator network, and the discriminator can be discarded. As a result, the generator creates samples that are from the same distribution of the data. They successfully have been implemented for generating samples by learning the data generating distribution from a limited amount of data. Currently, GANs are widely used to generate new texts and images for different purposes. One important application of GANs is to enhance the performance of the classifiers that are trained by imbalanced datasets. An imbalanced dataset can severely affect the performance of the classifier, and these types of datasets are prevalent in medical applications. For example, in breast cancer datasets, the number of mammography images with malignancy is much less than benign ones. This makes the classifier biased towards the benign class. To solve this problem, GANs can be used to make such datasets balanced. We can train a GAN to generate malignant images, then make new samples of the malignant cases.
We suggest an approach that each healthcare organization, especially hospitals, could create and share their own GAN - Hospital-Based GANs (H-GANs) instead of sharing raw data of patients. This solution provides a framework for sharing the hospital data without violating patients’ privacy by providing a generator of data instead of the patients’ data records. In summary, this solution provides three major advantages: first and foremost is preserving patients’ privacy. Second, it enables the researchers to create an unlimited amount of data to train complex models that require huge amounts of data, such as deep learning classifiers. Also, it mitigates the imbalanced dataset issue. Besides, it reduces the required storage and bandwidth for storing and transferring the data by sharing the models instead of the whole images. For example, a dataset consisting of 5000 mammography images requires around 100GB, while the GAN model created from this dataset is around 100MB. That means a 1:1000 compression ratio. At the next level, The H-GANs could theoretically be combined to create multi-hospital, national, regional, and even global GANs, and these models could include a comprehensive range of samples.
Made substantial contributions to the conception and design of the study and performed data analysis, interpretation and data acquisition, as well as providing administrative, technical, and material support: Ayyoubzadeh SM (Seyed Mohammad Ayyoubzadeh), Ayyoubzadeh SM (Seyed Mehdi Ayyoubzadeh), Marzieh EsmaeiliAvailability of data and materials
Not applicable.Financial support and sponsorship
None.Conflicts of interest
All authors declared that there are no conflicts of interest.Ethical approval and consent to participate
Not applicable.Consent for publication
© The Author(s) 2022.
1. Bauchner H, Golub RM, Fontanarosa PB. Data sharing: an ethical and scientific imperative. JAMA 2016;315:1237-9.DOIPubMed
2. McCoy MS, Joffe S, Emanuel EJ. Sharing patient data without exploiting patients. JAMA 2020;323:505-6.DOIPubMed
3. Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge, MA: MIT Press; 2016. Available from: https://books.google.com.hk/books?hl=zh-CN&lr=&id=omivDQAAQBAJ&oi=fnd&pg=PR5&dq=Goodfellow,+I.,+Y.+Bengio,+and+A.+Courville,+Deep+learning.+2016:+MIT+press.&ots=MNS-dvnBPZ&sig=NJdjTCQPqdh_9MNYzT7igJdFhfE&redir_esc=y#v=onepage&q=Goodfellow%2C%20I.%2C%20Y.%20Bengio%2C%20and%20A.%20Courville%2C%20Deep%20learning.%202016%3A%20MIT%20press.&f=false [Last accessed on 25 Aug 2022].
4. Goodfellow I. NIPS 2016 tutorial: Generative Adversarial Networks. arXiv 2017; doi: 10.48550/arXiv.1701.00160.DOI
5. Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. Adv Neural Inf Process Syst 2014;27:2672-80. Available from: https://proceedings.neurips.cc/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3 [Last accessed on 25 Aug 2022]
6. Iqbal T, Ali H. Generative Adversarial Network for Medical Images (MI-GAN). J Med Syst 2018;42:231.DOIPubMed
Ayyoubzadeh SM, Ayyoubzadeh SM, Esmaeili M. Clinical data sharing using Generative Adversarial Networks. Conn Health 2022;1:98-100. http://dx.doi.org/10.20517/ch.2022.15