• Lumiera
  • Posts
  • 🔆 Where Metrics Meet Meaning: Inside the AI Benchmarking Arena

🔆 Where Metrics Meet Meaning: Inside the AI Benchmarking Arena

From its origins to its profound impact on technology, learn how benchmarks shape the evolution of AI systems, driving progress, transparency, and trust.

🗞️ Issue 10 // ⏱️ Read Time: 5 min

Hello 👋

In this week's newsletter

What we’re talking about: The evolution and use of benchmarks to assess the capabilities of AI systems. Essentially, goals for the AI system to hit.

How it’s relevant: AI benchmarking allows for a fair and objective comparison of different AI models. By using standardized benchmarks, we can compare different AI models and algorithms, make informed decisions, and work towards developing AI systems that align with ethical standards and societal values.

Why it matters: By providing standardized metrics, fostering competition, and pushing the boundaries of what's possible, benchmarks drive progress, transparency, and trust in AI technologies, propelling us toward innovative advancements and reliable AI systems.

Big tech news of the week…

🖥️ NVIDIA, a major player in the computer chip industry, unveiled “Blackwell,” its latest GPU architecture, during the NVIDIA GTC AI Conference. Featuring six transformative technologies for accelerated computing, including the B200 GPU and GB200 “superchip,” Blackwell represents a significant leap forward in computing power and efficiency compared to NVIDIA’s previous offerings. Read more about GPUs in our newsletter here!

🌍 Monoculture and bias: Large language models trained in English were found to use the language internally, even for prompts in other languages. The model needs to know a lot about the world, and one way to do this is to reason about concepts rather than single words. Researchers theorize that this representation of the world in terms of concepts is biased towards English.

⚖️ The US Securities and Exchange Commission (SEC) has fined two investment advisers for “AI Washing,” after they allegedly made false statements about their use of artificial intelligence technology. The two companies, Delphia (USA) and Global Predictions, settled with the SEC and agreed to pay $400,000 in civil penalties.

📱 Apple is in talks to let Google Gemini power iPhone AI features. This deal would further Google’s reach and highlight its upper hand in launching smartphone-related features. To date, Google has partnered with Samsung to add Gemini-powered AI features on the Galaxy devices and has also launched these features on its own Pixel series of phones. These negotiations are taking place while Apple develops its own generative AI models for future release.

🧱🙇‍♂️ LEGO apologizes to fans for using AI after the company released images that violated LEGO’s policy not to use generative AI to create LEGO content. “We fundamentally believe in the wonder and power of human creativity and will continue to encourage and celebrate the talented artists who help bring our brand and characters to life," LEGO said.

The Olympics for Artificial Intelligence: AI Benchmarking

This week we are looking at the pivotal role of benchmarks in shaping the future of AI technologies. Just as athletes compete in various standardised competitions, AI models strive to outperform each other.

As AI continues to evolve, how will benchmarks adapt to capture the true potential and limitations of these systems?

The Lumiera Question of the Week

Benchmarking has come a long way since its inception in the 1980s with the establishment of the System Performance Evaluation Cooperative (SPEC), which set the stage for measuring computer system performance. Originally tailored for central processing units (CPUs), SPEC benchmarks evolved to tackle real-world applications. This trend continued with initiatives like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and MLPerf, which standardised benchmarks for computer vision and machine learning. For language models, benchmarks such as General Language Understanding Evaluation (GLUE) have become essential, ensuring models understand language across diverse tasks. As models approach human-level performance, newer benchmarks like The Holistic Evaluation of Language Models (HELM) and MLAgentBench have emerged, offering more nuanced and multifaceted evaluations of factors, such as reasoning, fairness, and efficiency. 

Some common benchmarks include:

  • MMLU (Undergraduate Level Knowledge)

  • GPQA (Graduate-Level Google-Proof Q&A)

  • GSM8K (Grade School Math)

  • HumanEval (Coding)

  • HellaSwag (Common Knowledge)

​​Benchmarks have had a profound impact on both the computer hardware industry and AI research. In the hardware realm, they have influenced performance standards, guiding purchasing decisions and shaping the evolution of processing units and other components. By understanding hardware performance, developers can make informed choices about which hardware platforms are best suited for specific AI applications. Hardware manufacturers also use these benchmarks to identify areas for improvement, driving innovation in AI-specific chip designs.

In AI research, benchmarks have paved the way for advancements by providing objective and measurable results to strive for. However, optimising can lead to decreased performance on non-benchmark tasks, a process known as “benchmark engineering” (sometimes referred to as the “benchmark effect”). 

🧗‍♀️ Challenges include:

  1. Considering whether benchmarks truly reflect real-world performance. Coupled with the repositioning of AI models as general-purpose systems, with less connection to old, task-specific benchmarks, researchers are encouraged to develop more comprehensive assessments that capture the true capabilities of AI systems and address issues like bias and interaction difficulties.

  2. Developing methods that lead to a more holistic assessment and promote trust among users. Evaluating AI models based on social factors and understanding how they handle uncertainty, ambiguity, or adversarial inputs is a good place to start.

  3. Increasing transparency. While there are efforts to promote transparency in AI development, comprehensive benchmarks for assessing transparency in training data and AI systems are still lacking.

  4. Finding the right AI model. Choosing the right model for a given use case requires careful consideration. The best model for your needs may not necessarily be the one at the top of a leaderboard.

As we continue to rely more on AI, it is essential to understand how we measure and compare its performance. Benchmarking provides one way to do this, and expanding this arena can help ensure that the AI systems we develop are reliable, fair, and beneficial to all. 

What we are excited about:

🪐 Have a friend who can never take a good photo? Finally, a solution for photos that look like they were taken on a potato!

Johns Hopkins researchers have developed an efficient new method to turn blurry images into clear, sharp ones.

Progressively Deblurring Radiance Field (PDRF) addresses various types of degradation, including camera shake, object movement, and out-of-focus scenarios.

🎨 The Algorithmic Frontiers Interactive Exhibit invites us to engage in a conversation on responsible AI.

It features 12 digital art pieces that counter gender, racial, and cultural biases in AI. The artist and curator, Valentine Goddard, designs and leads transdisciplinary programs that highlight the role of the arts and civil society in AI and digital governance.

Until next time.
On behalf of Team Lumiera

Emma - Business Strategist
Sarah - Policy Specialist
Allegra - Data Specialist

Lumiera has gathered the brightest people from the technology and policy sectors to give you top-quality advice so you can navigate the new AI Era.

Follow the carefully curated Lumiera podcast playlist to stay informed and challenged on all things AI.

What did you think of today's newsletter?

Login or Subscribe to participate in polls.