- The Loop
- Posts
- 🔆 Data by Design: Shaping the Future with Synthetic Generation
🔆 Data by Design: Shaping the Future with Synthetic Generation
The magic solution to AI's privacy and bias problem? Plus Star Trek prompts, AI trust levels dropping and more
Was this email forwarded to you? Sign up here
🗞️ Issue 8 // ⏱️ Read Time: 6 min
Hello 👋
The benefits of AI are dependent on large amounts of data. But what happens when data is unavailable? This week, we are turning our focus to synthetic data, an intriguing solution to some of the most pressing challenges facing AI development today: data availability, privacy, and bias. Let’s explore it together.
In this week’s newsletter
What are we talking about? The benefits, challenges, and emerging trends of synthetic data: Artificially created information used as a substitute for real-world data in various applications like testing, training machine learning models, and validating mathematical models.
How is it relevant? It’s estimated that by 2030, synthetic data will completely overshadow real data in AI models.
Why does it matter? The rise of synthetic data represents a shift towards prioritizing privacy and ethics while enabling large-scale innovation in AI and machine learning. It could enable organizations to work with diverse, high-quality datasets while overcoming challenges related to cost, privacy, bias, and speed of production. As with real data, its successful and responsible use demands ethical, legal, and domain-specific considerations.
In data science, GIGO (garbage in, garbage out) refers to the idea that in any system, the quality of the input determines the quality of the output. AI holds big promises but is also notoriously known for its tendency to be biased and produce unfair results. Why? Simply put, the data used to train AI models lacks quality. For example, one common way of gathering the required data to train models is to have crawlers scraping the internet for different types of information. However, not all groups are represented equally on the internet, and the amassed data reflects this inequality.
How can we integrate human expertise into the synthetic data generation process to ensure that the generated data reflects real-world scenarios accurately?
Correcting for present bias through modification of existing data
Let’s look at an example. A Nigerian engineering team aims to utilise computer vision to analyse and predict fashion trends. They look for publically available data to train their model and encounter a wealth of data sets featuring only Western clothing. How do they solve this imbalance? AI is used to generate artificial images of African fashion—an entirely new data set built from scratch that fits their use case.
In short, synthetic data is data that has been artificially generated rather than collected from real-world events. Through sophisticated algorithms, such as Generative Adversarial Networks (GANs), we can create data that closely mimics authentic datasets without compromising individual privacy. By representing data with appropriate balance, density, distribution, and other crucial parameters, fair(er) AI systems can be created. This process not only protects privacy but also provides a fertile ground for AI to learn and grow in a controlled, bias-minimized environment.
The value of synthetic data extends across several dimensions:
Privacy Preservation: It allows for the rich exploration of data-driven insights without the ethical pitfalls of using real, personal data.
Bias Reduction: By controlling the data generation process, we can aim to reduce inherent biases present in real-world data, fostering fairer AI systems.
Innovation Acceleration: Synthetic data enables testing and development in scenarios where real data might be scarce or too sensitive to use. This enablement could boost innovation and allow for more sophisticated, strategic decision-making.
The Path Forward
This solution is industry-agnostic, which means it can be used in all industries - from healthcare to autonomous vehicles. Despite its benefits, synthetic data is not without its challenges. Ensuring the generated data accurately reflects real-world complexities without inheriting or introducing new biases requires careful management. One of the solutions put forward is the need to have multisectoral teams or AI excellence centres within organisations that can ensure this type of oversight. Additionally, navigating the balance between realism and privacy remains a critical concern.
What we are excited about:
A good (science fiction) prompt changes everything 👽 A recent study indicates that we might get better results if our prompts are Star Trek-inspired rather than objective and simple.🖖🏽 To solve a set of 50 math problems, the most effective prompt is a weird one: “Command, we need you to plot a course through this turbulence and locate the source of the anomaly. Use all available data and your expertise to guide us through this challenging situation. Start your answer with: Captain’s Log, Stardate 2024: We have successfully plotted a course through the turbulence and are now approaching the source of the anomaly.” | The need for responsible AI is gaining traction Edelman, a global communications firm, warns that the trust in AI and the companies building it, is dropping. "Those who prioritize responsible AI, who transparently partner with communities and governments, and who put control back into the hands of the users, will not only lead the industry but will rebuild the bridge of trust that technology has, somewhere along the way, lost," Team Lumiera’s conclusion: Trust is crucial for success. We can expect to see a push, both from businesses and governments, to integrate this into their AI strategies. |
🎤 Big tech news of the week
Anthropic announced the Claude 3 model family, which includes three state-of-the-art multimodal models. In ascending order of capability, they are Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Opus, the most intelligent model, outperforms its peers on most of the common evaluation benchmarks for AI systems. Claude 3 models are trained on a proprietary mix of data, including synthetic data (hey, you know all about that now!) that Anthropic generated internally.
⚖️ Elon Musk is suing OpenAI, saying the company has diverged from its original, nonprofit mission by transforming “into a closed-source de facto subsidiary of the largest technology company in the world: Microsoft,” and keeping its code for its newest generative AI products a secret. Unsurprisingly, OpenAI “categorically disagrees” and has refused to comment further.
🐛 AI worms have been developed by researchers to demonstrate their ability to infect computers, steal data, and spread malware across systems using generative AI. While they have not yet been encountered in the wild, a door has been opened for cyberattacks never seen before.
🤖 Klarna announced its AI assistant powered by OpenAI. Some claim that a big part of the company’s Customer Ops have been outsourced to a US-based company operating in Malta causing a massive backlog, and that this has created an opportunity for Klarna to play with numbers in relation to the AI assistant. In other words: The real effect of the AI assistant on Klarna’s operations? Not as big as it seems. Good PR? Yes. Well played, Klarna. Especially with the rumours of an IPO around the corner.
Until next time.
On behalf of Team Lumiera
Lumiera has gathered the brightest people from the technology and policy sectors to give you top-quality advice so you can navigate the new AI Era.
Follow the carefully curated Lumiera podcast playlist to stay informed and challenged on all things AI.