SEPTEMBER 26, 2022

What is Synthetic Data?

Technical Blog

Data Science & Engineering

Data Science & Analytics

Management & Optimization

Stephanie Jones

Great applications can’t be created without data of the same quality. But there are situations in which having access to the needed datasets proves very difficult.

Let’s discuss an example. Maintaining privacy is particularly important for everyone. Laws that have appeared in the past decade have shown how access to data should be safeguarded and limited to need-to-know cases. But this also is a challenge for anyone creating tools in these restrictive environments. How can researchers test their models against medical data to investigate new vaccines for epidemics? It might be possible to remove private identifiable information (PII) to use the datasets, but as time has proven, it is easy to make data identifiable again. Also, what happens if the data is completely off-limits or does not exist at all?
This brings us to the question: How do we create tools without access to the data needed for them? Synthetic data is the answer. Concisely, synthetic data consists of generating data digitally with simulations and algorithms, which allow these datasets to have the same mathematical and statistical properties of real-world data.

A brief history

The idea is not new. In 1993, Donald Rubin, a Harvard statistics professor, authored a paper about how to analyze data from a US census while keeping the anonymity of the responders. His solution: generate anonymous census data which mimics the statistics of the real-world data. For economic and medical researchers, the idea was popular.

But it wasn’t until the mid 2010’s when the idea gained traction in other fields like machine learning. The automotive industry and the investigation into autonomous vehicles widely benefit from this concept. To create a truly safe autonomous vehicle many driving scenarios need to be tested, like rain, pedestrians, problems on the road and so on. Gathering the data for these scenarios in the real world might take too long or be virtually impossible. Using simulations to generate the scenario and train the machine learning models on this data is the industry’s solution.

Use cases

Computer vision is another field which takes advantage of synthetic data. Labelling images has largely required humans doing this job and there’s always a risk of mislabels which in turn, introduce error to the data used when training the models. Synthetic data avoids this situation by always generating perfectly labelled data which can be trusted to train the deep-learning algorithms.

Also, the data generated by these algorithms can include information which humans cannot label, like depth, reflection, and such. Machines have an easier time distinguishing and labeling these aspects. If then, the models are trained with this data, it gives robots the ability to do their job better interacting with the real world. Google showed in 2020 in their research into transparent objects one approach to this.

Creating synthetic data

The importance of having tools available to generate this kind of data has given way to new algorithms and advances in the past years. Consider creating images to train deep-learning algorithms. One way is using generative adversarial networks (GAN’s). This technique is a “game” made up two players. The first one is a generator neural network which focuses in generating images which are mathematically similar to existing images, while the second player, a discriminator neural network, is given a dataset composed of these new generated images and real images with the mission to distinguish which are synthetic. As the iterations progress, eventually the discriminator success rate falls to the point where it is no different to random guessing, which means the synthetic images can no longer be distinguished from the real images. Creating images in this way is faster than gathering them in the real world.

Advantages and challenges of synthetic data

Synthetic data helps in achieving the data goals any researcher has: accurate, high quality, balanced and unbiased data. Based on the right mathematical models and by being created with perfect data labeling, synthetic data achieves in an effective way, the first three goals while also being scalable and easy to use since there is no need to clean the data, like removing duplicates or inaccuracies.

Bias has always been a challenge. Society shows bias and data collected from the real world tends to show bias – sexism, racism, ageism to name a few. It is then important to do audits to understand if the model is now reproducing these patterns. Consider a platform for recruitment which considers facial recognition and speech patterns, it would be paramount in these cases to do frequent and proper audits to prevent ageism or gender-bias to be reflected by the system in choosing the candidates. At this point in time, the development of the audits is still in progress since there’s no industry standard defined.

Realism and privacy are also concerns; the generated data cannot be so similar to the real world that it in turn creates privacy issues. But it can also not be unrealistic, since training a model on data which does not reflect the original data mathematically is useless. Finding this balance is important when the data is created.

Current State

There are many companies dedicated to creating synthetic data today and the estimate by Gartner is that by 2030’s “synthetic data will completely overshadow real data in AI models.”

There are also projects like the Synthetic Data Vault, which is an ecosystem of libraries “that allows users to easily learn certain datasets to later generate new Synthetic Data that has the same format and statistical properties as the original dataset”. Tools like this help others into creating better models.

Key Takeways

AI is here to stay, and the benefits synthetic data gives to the development of tools cannot be ignored. Challenges remain to be addressed and new tools will be created in the next few years. Only time will tell how the ecosystem will develop.

References:

Guide: Synthetic Data, https://datagen.tech/guides/synthetic-data/synthetic-data/
How Synthetic Data is Accelerating Computer Vision, https://hackernoon.com/how-synthetic-data-is-accelerating-computer-vision-xp153w6q
Learning to see transparent objects: https://ai.googleblog.com/2020/02/learning-to-see-transparent-objects.html
Is Synthetic Data the future of AI, https://www.gartner.com/en/newsroom/press-releases/2022-06-22-is-synthetic-data-the-future-of-ai
Synthetic Data is About to transform Artificial Intelligence, https://www.forbes.com/sites/robtoews/2022/06/12/synthetic-data-is-about-to-transform-artificial-intelligence/
The Real promise of Synthetic Data, https://news.mit.edu/2020/real-promise-synthetic-data-1016
The Synthetic Data Vault, https://sdv.dev/
What does an audit for bias in automated hiring tech really means, https://www.hrbrew.com/stories/2022/04/01/what-does-an-audit-for-bias-in-automated-hiring-tech-really-mean

About Encora

Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.