AI and Robotics

Use of Synthetic Data, in Early Stage, Seen as an Answer to Data Bias

By AI Trends Staff Assuring that the huge volumes of data on which many AI applications rely is not biased and complies with restrictive data privacy regulations is a challenge that a new industry is positioning to address: synthetic data production. Synthetic data is computer-generated data that can be used as a substitute for data […]

Techatty

Nov 30, -0001 - 00:00

Use of Synthetic Data, in Early Stage, Seen as an Answer to Data Bias

By AI Trends Staff

Assuring that the huge volumes of data on which many AI applications rely is not biased and complies with restrictive data privacy regulations is a challenge that a new industry is positioning to address: synthetic data production.

Gary Grossman, Senior VP of Technology Practice, Edelman

Synthetic data is computer-generated data that can be used as a substitute for data from the real world. Synthetic data does not explicitly represent real individuals. “Think of this as a digital mirror of real-world data that is statistically reflective of that world,” stated Gary Grossman, senior VP of Technology Practice Edelman, public relations and marketing consultants, in a recent account in VentureBeat. “This enables training AI systems in a completely virtual realm.”

The more data an AI algorithm can train on, the more accurate and effective the results will be.

To help meet the demand for data, more than 50 software suppliers have developed data synthetic products, according to research last June by StartUs Insights, consultants based in Vienna, Austria.

One alternative for responding to privacy concerns is anonymization, the masking or elimination of personal data such as names and credit card numbers from ecommerce transactions, or removing identifying content from healthcare records. “But there is growing evidence that even if data has been anonymized from one source, it can be correlated with consumer datasets exposed from security breaches,” Grossman states. This can even be done by correlating data from public sources, not requiring a security hack.

A primary tool for building synthetic data is the same one used to create deepfake videos—generative adversarial networks (GANs), a pair of neural networks. One network generates the synthetic data and the second tries to detect if it is real. The AI learns over time, with the generator network improving the quality of the data until the discriminator cannot tell the difference between real and synthetic.

A goal for synthetic data is to correct for bias found in real world data. “By more completely anonymizing data and correcting for inherent biases, as well as creating data that would otherwise be difficult to obtain, synthetic data could become the saving grace for many big data applications,” Grossman states.

Big tech companies including IBM, Amazon, and Microsoft are working on synthetic data generation. However, it is still early days and the developing market is being led by startups.

A few examples:

AiFi — Uses synthetically generated data to simulate retail stores and shopper behavior;

AI.Reverie — Generates synthetic data to train computer vision algorithms for activity recognition, object detection, and segmentation;

Anyverse — Simulates scenarios to create synthetic datasets using raw sensor data, image processing functions, and custom LiDAR settings for the automotive industry.

Synthetic Data Can Be Used to Improve Even High-Quality Datasets

Dawn Li, Data Scientist, Innovation Lab, Finastra

Even if you have a high-quality dataset, acquiring synthetic data to round it out often makes sense, suggests Dawn Li, a data scientist at the Innovation Lab of Finastra, a company providing enterprise software to banks, writing in InfoQ

For example, if the task is to predict whether a piece of fruit is an apple or an orange, and the dataset has 4,000 samples for apples and 200 samples for oranges, “Then any machine learning algorithm is likely to be biased towards apples due to the class imbalance,” Li stated. If synthetic data can generate 3,800 more synthetic examples for oranges, the model will have no bias toward either fruit and thus can make a more accurate prediction.

For data you wish to share that contains personally identifiable information (PII), and for which the time it takes to anonymize makes that impractical, synthetic samples from the real dataset can preserve important characteristics of the real data and can be shared without the risk of invading privacy and leaking personal information.

Privacy issues are paramount in financial services. “Financial services are at the top of the list when it comes to concerns around data privacy. The data is sensitive and highly regulated,” Li states. As a result, the use of synthetic data has grown rapidly in financial services. While it is difficult to obtain more financial data, because of the time it takes to generate real world experience, synthetic data can be generated to allow the data to be used immediately.

A popular method for generating synthetic data, in addition to GANs, is the use of variational autoencoders, neural networks whose goal is to predict their input. Traditional supervised machine learning tasks have an input and an output. With autoencoders, the goal is to use the input to predict and try to reconstruct the input itself. The network has an encode and a decoder. The encoder compresses the input, creating a smaller version of it. The decoder takes the compressed input and tries to reconstruct the original input. In this way, scaling down the data in the encode and building it back up from the encode, the data scientist is learning how to represent the data. “If we can accurately rebuild the original input, then we can query the decoder to generate synthetic samples,” Li stated.

To validate the synthetic data, Li suggested using statistical similarity and machine learning efficacy. To assess similarity, view side-by-side histograms, scatterplots, and cumulative sums of each column to ensure we have a similar look. Next, look at correlations and plot a matrix of the real and synthetic data sets to get an idea of how similar or different the correlations are.

To assess machine learning efficacy, review a target variable or column. Create some evaluation metrics and assess how well the synthetic data performs. “If it performs well upon evaluation on real data, then we have a good synthetic dataset,” Li stated.

Best Practices for Working with Synthetic Data

Best practices for working with synthetic data were suggested in a recent account in AIMultiple written by Cem Dilmegani, founder of the company that seeks to “democratize” AI.

First, work with clean data. “If you don’t clean and prepare data before synthesis, you can have a garbage in, garbage out situation,” he stated. He recommended following principles of data cleaning, and data “harmonization,” in which the same attributes from different sources need to be mapped to the same columns.

Also, assess whether synthetic data is similar enough to real data for its application area. Its usefulness will depend on the technique used to generate it. The AI development team should analyze the use case and decide if the generated synthetic data is a good fit for the use case.

And, outsource support if necessary. The team should identify the organization’s synthetic data capabilities and outsource based on the capability gaps. The two steps of data preparation and data synthesis can be automated by software suppliers, he suggests.

Read the source articles and information in VentureBeat, in InfoQ and in AIMultiple.

Tags:

Early Experience with GPT-3 Large Language Model Points to Uncertainty

Techatty Connecting the world of tech differently! Read. Write. Learn. Thrive. Make an informed decision without distractions. We are building tech media and publication networks to connect YOU and everyone to reliable information, opportunities, and resources to achieve greater success.

Sponsor to Give Hope, Transform, and Uplift Lives.

	Need help implementing innovative technology, with tech support or management? You can count on us.
	24-7 Press Release - Let's distribute your Press Releases to traditional and digital media outlets. Get started!
	Reliable Website Security Solutions, built for small businesses, web professionals, and enterprise organizations.
	Paternity Lab - bringing DNA Paternity Testing closer to people. We offer accurate, affordable, and easy DNA Paternity Testing. Also at home.
	Rexing USA - exclusive cameras, car gadgets, and EV accessories with unique designs, innovative technology, and in affordable price ranges.

The Rising Wave of Blockchain Technology Adop...

HackaTRON Season 7 Launches With Google Cloud...

Skybridge Founder: Kamala Harris Open-Minded ...

Auradine Ships 3nm Teraflux Bitcoin Mining Pl...

Wazirx Details Plan to Resume Withdrawals and...

Agentic AI Leaders to Showcase Latest Advance...

NVIDIA Releases NIM Microservices to Safeguar...

How AI Is Enhancing Surgical Safety and Educa...

NVIDIA and IQVIA Build Domain-Expert Agentic ...

AI Gets Real for Retailers: 9 Out of 10 Retai...

Alleged Co-Founder of Garantex Arrested in India

Feds Link $150M Cyberheist to 2022 LastPass H...

Who is the DOGE and X Technician Branden Spikes?

Notorious Malware, Spam Host “Prospero” Moves...

U.S. Soldier Charged in AT&T Hack Searched “C...

Use of Synthetic Data, in Early Stage, Seen as an Answer to Data Bias

Tags:

Early Experience with GPT-3 Large Language Model Points to Uncertainty

AI Making Some Headway into Diagnosis and Prognosis in Dentistry

IBM Reportedly Retreating from Healthcare with Watson

Making machine learning more useful to high-stakes deci...

“AI-ISP: Adding Real-time AI Functionality to Image Sig...

Change language

SPONSORED

Recommended for you

Great Opportunity You Can't Reject! (No, Seriously...

Pause and let's talk about responsible spending an...

Experts Estimate £20 Million+ Loss from Heathrow A...

Welcome to ProtoPie

Ready to turn your innovative tech business dream ...

Gold Could Surge to $40,000 per Ounce, Strategist ...

Web & Cloud - Engineering Tech for a Better Tomorrow!

Introducing: Techatty Aerospace

Use of Synthetic Data, in Early Stage, Seen as an Answer to Data Bias

Tags:

Related Posts

Change language

SPONSORED

Recommended for you