Synthetic data for leveling the tech playing field

March 21, 2019

Big data has been the preserve of the largest tech companies, but synthetic data may be about to level the playing field.

https://www.youtube.com/watch?v=42QZggpb–M

The advent of synthetic data is upon us with implications for a variety of industries as well as applications for government intelligence agencies.

On March 12, 2019 the venture capital arm of the CIA, In-Q-Tel (IQT), formed a strategic partnership and investment with AI.Reverie, a leading provider of computer vision algorithms leveraging its proprietary synthetic data platform.

US government agencies, along with the military, are looking to AI.Reverie because the company can produce photorealistic virtual worlds that closely mimic any true location where clients’ services are used.

Why would intelligence agencies be interested in synthetic data? Let us take a look at what synthetic data is, and then we’ll explore its use cases.

What is Synthetic Data?

Synthetic data is information which is manufactured artificially rather than data generated from real world events. It is used to model and validate production, operational and mathematical datasets and in the training of machine learning models. In layman’s terms, it’s computer generated data which mimics real world data.

This artificial dataset then aids a computer or artificial intelligence (AI) neural network in how to respond to specific criteria or situations in the absence of or as a replacement for real world data.

What are its Use Cases?

Santa Clara-based Nvidia is using synthetic data in Nvidia Constellation – its cloud-based autonomous simulation platform. David Nister, VP of Autonomous Driving Software at Nvidia explained to VentureBeat that “by removing human error from the driving equation, we can prevent the vast majority of collisions and minimize the impact of those that do occur”.

Synthetic data is implicated in achieving this. Its software follows the core principal of collision avoidance rather than a range of preset rules. Calculations are performed on-vehicle, frame by frame. Both real world and synthetic data has been used to validate the data used within the collision avoidance system which has been based on high risk highway and urban driving scenarios.

The vast majority of car crashes are caused by human error. However, whilst issues to date with autonomous driving systems have been few and far between, they have been highly publicized. For that reason, the work that Nvidia is doing is important in reducing what is already a lower risk further still. Synthetic data is playing a significant role in that process. Additionally, it’s speeding up the modelling process.

Raman Mehta, CIO at automotive company, Visteon, suggests that not a lot of labelled data is required whilst at the same time, the learning curve of models is speeded up when using synthetic data.

Synthetic Data Market

Companies such as Facebook have made billions off the back of data. That market is likely to undergo considerable change in the years ahead with people being empowered to take ownership of their own data. However, there is also a market for synthetic data.

So long as companies embrace regulatory compliance related to synthetic data, then there is every reason to believe that it will be a valuable commodity for use on a commercial basis by enterprises.

In fact, it’s in consideration of the rise in the assertion of data privacy that synthetic data may become of more importance and relevance. In the UK, synthetic data is being used for the first time by Health Data Insights in the health sector.

Marketed as Simulacrum, the company partnered with British pharmaceutical giant AstraZeneca and the U.S. health information technology company, IQVIA, to produce a data pool of synthetic cancer data.

This artificial data is modeled on actual patient data collected by the National Cancer Registration and Analysis Service (NCRAS), a part of the UK’s National Health Service (NHS).

On a pharmaphorum webinar last November, Jem Rashbass, Medical Director at Health Data Insight, explained the company’s product offering as it pertains to synthetic data:

“We set out to create a dataset that had the look and feel and the same structure as the data held by Public Health England but was in fact entirely synthetic – and that’s what the Simulacrum is.”

Prediction models that rely on AI are also leaning on synthetic data to train the AI. A team working under Juliet Biggs, a volcanologist at Bristol University in England, is using AI to produce a model that can predict volcanic eruptions. It’s training its neural network on data generated synthetically from simulated eruptions.

Data Democracy but not Without its Pitfalls

As highlighted above, use cases are emerging for the technology. It’s democratizing in that it allows tech startups to compete with tech giants such as Google and Facebook who have had ongoing access to the largest of datasets. Using synthetic data also means cost savings by comparison with the collection of real world data.

With the integrity of personal data usage now being a contentious subject, such data can be anonymized – stripped of all personal information – and then used as synthetic data. This is likely to become a solution for many companies and research organizations as they seek to progress their projects whilst adhering to increasing standards and regulations in the area of personal data.

However, no new technology is without its limitations. Whilst artificial data may mimic the properties of corresponding real world data, it’s not a direct copy. Therefore, the subtleties of the dataset are not likely to be the same. This may hamper the capability of synthetic data in system training situations, negatively impacting on its output accuracy.

Added to that, the quality of synthetic data is largely dependent upon the efficacy of the model that was used to produce it. Another step is also required with the use of a verification server as an intermediate step to interrogate the initial data.

Synthetic data will clearly have a part to play in research and development, modelling and the training of AI neural networks. It is an excellent source of cheap data. However, it is not a replacement for human interpreted real world data. Both will still be relevant as machine learning and the machine to machine (M2M) economy develop.

Pat Rabbitte

Pat is a writer from the West of Ireland - currently living and working in Medellín, Colombia. He has always had an inquiring mind when it comes to new technology. His discovery of Bitcoin back in 2013 slowly led to a realisation of the implications of the underlying tech. As a consequence, Pat’s passion for blockchain technology has led him to focus his writing on the subject.
VIEW ALL POSTS

< Next Post

Is edtech functioning as it should in the classroom?

Previous Post >

Who are the new beneficiaries of the sharing economy?

Business Technology

Why cost per token no longer reflects the true cost of AI

Cost per token is an infrastructure metric. Cost per successful task is a business...

July 22, 2026 HackerNoon

Technology

Own nothing, rent everything: tenets of the subscription economy

Our society is not new to paying for access. Renting homes, leasing office spaces, subscribing to...

June 30, 2026 Uche Nneoma

Big Tech Government and Policy Technology

‘World Models’ are needed to train AI robots in 3D: WEF ‘Summer Davos’

World Models to train AI robots cracks the WEF Top 10 Emerging Technologies at this year's Annual...

June 24, 2026 Tim Hinchliffe

Sociable's Podcast

Brains Byte Back

Brains Byte Back interviews startups, entrepreneurs, and industry leaders that tap into how our brains work. We explore how knowledge & technology intersect to build a better, more sustainable future for humanity. If you’re interested in ideas that push the needle, and future-proofing yourself for the new information age, join us every Friday. Brains Byte Back guests include founders, CEOs, and other influential individuals making a big difference in society, with past guest speakers such as New York Times journalists, MIT Professors, and C-suite executives of Fortune 500 companies.

Every hype cycle has a sales guy. Crypto had them. AI agents have them now, and most of what's being sold as an ”agent” is old automation with a new label. It's called agent washing, and Gartner projects 40%+ of agentic AI projects will be cancelled by 2027.

So how do you tell the real thing from a fresh coat of paint? And even if it is real, do you actually need a complex agent for what you're looking to achieve?

Host Erick Espinosa sits down with Mariano Jurich, Senior Product Leader at Making Sense, who helps mid-market and PE-backed companies separate real AI value from plausible noise, and who tells clients, more often than you'd expect, that the boring workflow automation is the better buy.

Inside this episode:

The four traits that define a real AI agent
The ”if-then-do” test that exposes a washed agent in one sentence
Why ”agentic” is overkill for most problems companies bring him
Why agentic AI breaks when you treat it like a cloud migration
What to put in writing before you sign
Find out more about Mariano Jurich here.
Learn more about Making Sense here.
Reach out to today's host, Erick Espinosa – [email protected]
Get the latest on tech news – https://sociable.co/
Leave an iTunes review – https://rb.gy/ampk26
Follow us on your favourite podcast platform – https://link.chtbl.com/rN3x4ecY