Cyber Bytes: What Is Synthetic Data?

From technology companies to health care institutions, everyone stands to benefit from synthetic data. Learn about synthetic data and its potential uses in real life.

What is synthetic data?

Synthetic data is data artificially created by computers. Computers can be trained to understand and replicate real-life statistical patterns without gathering and storing real-life information. Synthetic data provides research opportunities in areas previously off-limits due to a lack of data or privacy issues. This is helping to advance technology, health care, cybersecurity and other industries. Learn how synthetic data is transforming the handling and usability of data.

How is synthetic data created?

When real-life data is unavailable, difficult to obtain or contains sensitive information, synthetic data can be created using computer algorithms.

Synthetic data can be a cross-sampling from anonymized real-life data and synthesized information. Anonymized data is data from real-life studies with the identifiable information removed. Anonymized data is an ideal way to handle privacy issues, especially when dealing with legally protected data like health care information.

Certain information is protected under privacy laws like the federal Health Insurance Portability and Accountability Act, California’s Consumer Privacy Act and the European Union’s General Data Protection Regulation. Some organizations are legally required to anonymize data before using it for research or artificial intelligence (AI) model training.

For instance, say a hospital is participating in a medical study using sensitive patient information. They’d replace the original records containing personally identifiable information (PII) like names and birthdates with noise. They’d replace patient names with randomly generated IDs, and ages with age ranges. But they’d keep the patients’ diagnoses and treatment information. The noise keeps the data anonymous, making it impossible for someone to reverse engineer the data to expose the participants.

Another example is a technology company that anonymizes the data it collects to train its AI assistants on speech patterns. They’d strip the identifying client information from their spoken commands, focusing only on language and accent recognition. They’d train their AI assistants to analyze different languages, conversational commands and regional accents without connecting customers to their conversations.

While anonymizing data is an effective way to safeguard PII, it’s an extra step. In some cases, you can synthesize data without including anonymized data. This makes synthetic data even more appealing to companies and individuals.

Synthetic data companies

Companies now have off-the-shelf solutions to help you analyze and plan for synthesizing your data. A quick search of synthetic data companies provides dozens of results. When seeking assistance from a synthetic data company, know your business needs and data type.

For example, structured data is organized, easily accessible and stored in a fixed format, like a spreadsheet. In contrast, unstructured data lacks a specific form or organization, which makes it harder to collect and analyze. Most data is unstructured, including text files, emails, videos, social media posts and customer reviews.

Selecting a synthetic data company can be challenging, specifically regarding accuracy and relevance. If structured data is altered too much, it could become irrelevant.

Synthesized data strives to be statistically similar to actual customer data but doesn’t account for unpredictable human behaviors and emotions. You’ll need to strike a balance.

Evaluating your data to create useable synthetic data

You can use techniques related to data simulation and generative models to create synthetic data without anonymized real-life data. It still requires planning, including establishing the goal and limits of the data project and the data type needed to achieve your purpose. Some steps for creating and using synthetic data include:

Understand the characteristics of real-life data

Even though you’re not using real-life data, you should understand the type of data you want to synthesize. For example, does it aim to analyze customer retention, loan repayment probability, medical recovery or email click-through rates? What are the key features of the data you’ll need to conduct an assessment?

Decide on a generative model to create the data

You can use different models, like a generative adversarial network (GAN) or a variational autoencoder (VAE):

  • GANs are machine learning models with two parts: a generator and a discriminator. The generator creates synthetic data, and the discriminator evaluates its authenticity. They interact until the generated data is so realistic that the discriminator can’t identify it as artificial.
  • VAEs are a type of deep AI that can learn to create data similar to the data they’ve been trained on. They work like a translator program, first converting input data into a summary and then recreating the summary data with new variations. They continue to create variations in the data within specific parameters.

Train the generative model

Input the data that matches your desired data type. If you’re creating a risk management analysis for recent college graduates, you’ll only sample information from a select data set. Most data for training generative models comes from real-life public sources, like Google Cloud’s BigQuery Storage or Data.gov.

Generate your data

Once you’ve trained your AI model, you can use it to generate synthetic data. The synthetic data should have the same structure and statistical properties as the data it was trained to mimic. However, it shouldn’t have any sensitive PII.

Run quality checks

Compare your synthetic data with real-life data to ensure it maintains a similar statistical distribution. Remember, if you put bad data in, you’ll get bad data out.

Once you’ve verified your data, you can use it in your project analysis. Synthetic data benefits individuals and organizations, opening AI machine learning and simulation possibilities.

Uses for synthetic data

Synthetic data sounds technical, but it’s not only for tech companies. Wherever you use data (or wish you could) is an area where you can apply synthetic data.

  • Human resources professionals use synthetic data to analyze employee training applications against employee performance and review data without exposing employee information. Synthetic data can also put employees at ease and make them more willing to give honest feedback.
  • Software development teams use synthetic data to sandbox test programs without compromising client data. They can stress test systems using data that simulates real-life data.
  • Marketing teams use synthetic data to refine email campaigns. They might use it to conduct advanced A/B testing or predict customer interactions with social media posts.
  • Transportation and logistics companies use synthetic data to run advanced “what if” risk management scenarios on shutdowns in different parts of their supply chains (like natural disasters, supplier bankruptcies or shipping route blockages). Businesses can use synthetic data to anticipate pitfalls and improve their backup plans.
  • Health care industries are replicating patient histories for drug trial data and alternative treatments. Medical professionals can use synthetic data for training simulations and practicing procedures or testing diagnostic imaging systems. Synthetic genomic data can replicate DNA sequences and be used in biological research to understand disease patterns.
  • Financial institutions use synthetic data for risk management and fraud detection. It can help them better understand client behavior, like credit repayment. Investment platforms might use time-series synthetic data to capitalize on trends analyzed over long-term data inputs.
  • Auto manufacturers are using image-based synthetic data to train autonomous vehicles. They can train vehicles to recognize real-life images, anticipate behaviors and make decisions based on photo and video inputs.
  • Weather services can use synthetic data to make longer-term predictions. They can anticipate weather patterns, allowing people to decide where to build structures or homes.
  • Gaming and virtual reality industries use synthetic data to create realistic, immersive environments. As education advances, synthetic data might be widely accessible for hands-on learning and skill-building.
  • Video, image and audio systems use synthetic data to replicate real-life videos with unique characters, surveillance system testing and video processing. Conversely, scammers also use it to create deepfakes and other disinformation schemes.
  • Cybersecurity teams use synthetic data to simulate different types of cyberattacks and test their incident response effectiveness. This helps organizations identify vulnerabilities and improve their cybersecurity and incident response planning.

Remember, synthetic data is algorithmically created. Because it’s not directly linked to real data, it can be a cost-effective way to speculate outcomes if real-world data is unavailable. It has tremendous potential for fields like space travel and rare disease treatments.

From autonomous vehicles to medical studies, synthetic data is becoming a pivotal player in business development.