DIY Database: How Synthetic Data Generation Is Easing Healthcare’s Privacy-Protection Pains

The healthcare industry puts the “big” in “big data.” Healthcare accounts for approximately one-third of the world’s total data, and the amount of data is projected to grow faster than in any other industry. The data is complex and highly sensitive, presenting unique risks for providers and challenges for quality engineers and testers.

“Regulations like HIPAA and GDPR[^1] have strict security-related guidelines and rules about protecting sensitive patient information,” says Naresh Kumar Nunna, Head of Test Data Management at Qualitest. “Noncompliance can put patient privacy at risk and cost millions in fines. But since the pandemic, digital data sharing has increased and interoperability has become the norm to exchange information across providers and patients. There are more smart devices and other wearables around now, and they do more things, like heart monitoring, observing patient behaviors, and blood-level checks for more personalized recommendations. More and more patients are using telemedicine and video consultations compared to before. So, there is more private health data, and more need to share it.”

The applications and systems that enable healthcare data sharing require continuous testing of functionality, security and user acceptance. Testing requires dedicated test data, which is often collected by duplicating live production data and transferring it in its entirety to various test environments. That’s where the problems can start.

Slow, expensive, unfriendly to DevOps. And really risky.

For one thing, there are risks to business performance. Duplicating a live production database—which can contain millions and millions of records—and pushing the data down into lower environments slows down the whole SDLC, including the release cadence. It’s incompatible with DevOps and Agile, and storage and computing can be expensive, especially on the cloud if you don’t have cloud-optimized solutions.

Then there are the privacy-protection concerns. “HIPAA classifies 18 different attributes[²] as protected health information (PHI),” says Naresh. “Data that falls into those 18 categories must be de-identified before it can be used. Also, as part of the Safe Harbor Act[³], there is a minimum use rule, which limits the use of PHI to only what’s necessary for an intended purpose. It’s important to have an optimized test data process that provides only the right set of data that you need for your test. And at the same time as the data is provisioned, you have to obfuscate.”

Test data management best practices call for de-identifying PHI with techniques such as data masking, where words or characters are scrambled, shuffled, or replaced with symbols to obfuscate the real information. Other techniques include replacing names with pseudonyms, swapping addresses and dates, and encryption.

Of course, as solutions grow more sophisticated and algorithms evolve, threats are evolving too. ”It’s not just masking names and social security numbers anymore,” notes Naresh. “It’s device IDs, biometric sensors, fingerprints—all the modern information being collected from customers needs to be protected too. We’ve engineered much better solutions that are easy to understand, easy to use and easy to consume, so it’s not as complicated as it used to be. But it’s important to remember that the quality of the testing depends on the quality of the data itself, and on maintaining a culture of quality within an organization.”

For the best privacy protection, make it fake

Now there is a faster, easier way to obtain reliable test data that bypasses privacy concerns entirely: Create exactly the dataset you need for each test with synthetic data generation. “Say you want patient data with different genders, different ethnicities, different locations, different ages,” Naresh illustrates. “When you keep applying various attributes, it can go into thousands and thousands of permutations. Automated synthetic data generation tools help quickly create all that permutational data in all the varieties and volumes you need. Imagine the risk of developing and evaluating AI/ML models that analyze patient diagnostics and medical history using incomplete or missing data. That can cause more harm than the fundamental benefits of using AI/ML to advocate.”

Synthetic data safely removes roadblocks in current privacy protection practices and opens up new paths to innovation. Developers and data scientists can build models that are stronger and more predictive. They can access data on demand, receive quick feedback on their code and continuously refresh the data in the CI/CD pipeline.

For ongoing improvement in the digital customer experience, they can conduct continuous analysis of data that will always be up to date. “There’s no longer a need to wait for database administrators or DevOps teams to refresh the data and supply a copy,” Naresh says. “New tools can not only create data, but also deliver it in the EDI (electronic data interchange) format that has become standardized across the industry for interoperability sharing. Test data is moving towards a self-service, on-demand model.”

It’s also moving towards more AI-generated data. According to Gartner, 60% of all data used in the development of AI will be synthetic by 2024.[^4]“Intelligent synthetic data takes away the continuous concern of protecting customer data and at the same time provides data for scenarios that don’t exist, or data for new products being developed,” Naresh explains. “This all brings agility to accelerate test execution and reduce defect leakage into production, with data covering all permutations and combinations.”

Industry-aligned quality engineering solutions can help healthcare organizations reach business goals while offering patients and customers a superior digital experience.

FOOTNOTES

¹ HIPAA (Health Insurance Portability and Accountability) applies in the US. GDPR (General Data Protection Regulation) applies in the UK and the EU.

² Safe Harbor Law: https://www.hipaajournal.com/hipaa-safe-harbor-law/

³ HIPAA 18 PHI identifiers: https://compliancy-group.com/18-hipaa-identifiers-for-phi/

⁴Gartner: https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/

DIY Database: How Synthetic Data Generation Is Easing Healthcare’s Privacy-Protection Pains

share

Slow, expensive, unfriendly to DevOps. And really risky.

For the best privacy protection, make it fake

Recent posts

share