When to Use Synthetic Data

In brief

Synthetic data is not a substitute for understanding your customers. It is a way to keep working when the real records are locked behind privacy rules, too rare to model, or too slow to share.
The forecast is not subtle. Analysts expect most data used to develop AI and analytics to be synthetically generated, up from almost none a few years ago. The question for CX teams is no longer whether, but where.
The discipline that separates value from risk is validation. Use synthetic data where you can prove it behaves like the real thing, and keep real data for the decisions where being wrong is expensive.

Most customer experience teams do not have a data shortage. They have a data access problem. The records that would answer the question sit inside systems governed by privacy law, vendor contracts, or a security team that says no by default. So the model waits, the test environment stays empty, and the insight that could have shipped this quarter ships next year, if at all.

Synthetic data is the response to that bottleneck. It is generated by an algorithm that learns the statistical shape of real records, the distributions and the correlations between fields, and produces new records that look like the original without belonging to any actual person.

The shift is already underway

The trajectory is steep enough that ignoring it is a strategic choice, not a neutral one.

60% The share of data used to develop AI and analytics projects that Gartner expected to be synthetically generated by 2024, up from roughly 1% in 2021. Source: Gartner, via Tech Monitor

Exhibit 1

Synthetic data goes from rounding error to majority in three years

2021 (actual)1%

2024 (forecast)60%

Source: Gartner forecast, reported by Tech Monitor

The market is moving with it. Grand View Research values the synthetic data generation market at roughly 218 million dollars in 2023, growing to nearly 1.8 billion by 2030 at a compound rate above 35%. That is not hype money. It is procurement teams treating synthetic data as infrastructure.

When it earns its place

Synthetic data is worth the effort in a short list of situations, and forced everywhere else.

The data is sensitive. Customer records carry personally identifiable information and regulated fields. A synthetic copy preserves the relationships analysts need without linking back to a real individual, which lets development and testing proceed without exposing live data.
The data is scarce. Rare but decisive moments, a hardship claim, a churn-cancellation call, a fraud edge case, barely appear in the logs. Synthetic generation gives a model enough examples of the long tail to learn from.
The data cannot move. When contracts or jurisdiction stop you from sharing real records with a partner or an offshore team, a synthetic version travels freely and unblocks collaboration.
The real thing does not exist yet. Testing a new product, journey, or channel before launch means there is no history to draw on. Synthetic scenarios fill the gap so the system is stress-tested before a customer ever touches it.

Synthetic data is the answer to a data access problem, not a data quality problem. If your real records are good and reachable, use them.

The reason any of this works is that, validated properly, a synthetic copy carries the same signal. In a controlled MIT experiment, data scientists working from synthetic datasets produced results with no significant difference from those working with the real data in most tests.

11 of 15 The tests in which teams using synthetic data showed no significant performance difference from those using real data, in MIT's Synthetic Data Vault study (70% of the time). Source: MIT News

Exhibit 2

Synthetic data matched real data in most predictive tests

No significant difference vs real data11 of 15

Real data outperformed4 of 15

Source: MIT, Synthetic Data Vault study

When to keep your hands off it

The failure mode is treating synthetic data as a free substitute for real data. It is not. The generator can miss the rare events that matter most, smooth over relationships that drive the outcome, or quietly inherit the bias of the data it learned from. For high-stakes decisions, credit, eligibility, anything that touches a person’s livelihood, synthetic data belongs in testing, not in production scoring.

The privacy pressure that makes synthetic data attractive is also intensifying the risk of cutting corners. In Cisco’s 2025 benchmark, nearly half of organizations admitted to entering personal or non-public information into generative AI tools, and almost every respondent expects to shift budget toward AI. McKinsey, meanwhile, finds that 47% of organizations have already experienced at least one negative consequence from generative AI. The lesson is not to slow down. It is to validate.

A short framework for getting it right

Define what good enough means before you generate anything. Tie it to a downstream metric, model accuracy on real holdout data, not to how realistic the records look.
Match the method to the data. Simple perturbation, generative models, and rule-based scenarios solve different problems. Pick for the use case, not the brand name.
Validate against held-out real data. Check distributions, the key correlations, and whether a model trained on synthetic data performs on real cases.
Document the limits and keep humans on the high-stakes calls. Use synthetic data to complement real data, not to replace the judgment that should sit behind a decision that affects a customer.

Used this way, synthetic data does exactly one thing well: it removes the access bottleneck so your team can build, test, and learn at the speed the business actually needs. The teams that win are not the ones generating the most synthetic records. They are the ones who know precisely which decisions deserve the real thing.

To see how we put this discipline to work, explore our Advisory practice and our Predictive Satisfaction Score modeling, or browse the case studies.

Sources

Gartner forecast, reported in "Most AI training data could be synthetic by next year," techmonitor.ai.
Grand View Research, "Synthetic Data Generation Market Size & Share Report, 2030," grandviewresearch.com.
MIT News, "Artificial data give the same results as real data, without compromising privacy," news.mit.edu.
MIT Sloan, "What is synthetic data, and how can it help you competitively?," mitsloan.mit.edu.
Cisco, "2025 Data Privacy Benchmark Study," newsroom.cisco.com.
McKinsey & Company, "The state of AI," mckinsey.com.

When to Use Synthetic Data

The shift is already underway

When it earns its place

When to keep your hands off it

A short framework for getting it right

Want this applied to your data?