AI Researcher Lun Wang Departs DeepMind, Spotlights Gaps in LLM Evaluation

Home / Technology

A spoon on an empty stomach burns 26 lbs in a week

a spoon on an empty stomach burns 26 lbs in a week...

May 19, 2026

12:10 pm

55-year-old woman with baby face. Here's her secret!

55-year-old woman with baby face. here's her secret!...

May 19, 2026

11:58 am

AI Researcher Lun Wang Departs DeepMind, Spotlights Gaps in LLM Evaluation

May 19, 2026

12:16

AI Researcher Lun Wang Departs DeepMind, Spotlights Gaps in LLM Evaluation

Artificial intelligence keeps getting smarter. What systems are used to measure that intelligence? Not so much.

That’s the warning from Lun Wang, a senior researcher who recently left Google DeepMind and used his departure to spotlight what he sees as one of the biggest blind spots in modern AI development: the way large language models are evaluated.

In a post shared on X, formerly Twitter, Wang argued that current AI benchmarks are no longer sufficient for measuring increasingly advanced systems. His proposed solution — “self-evolving evals” — could reshape how the industry tests safety, intelligence, and reliability in the next generation of AI models.

Recent Posts

Hair Will Grow Back! No Matter How Severe the Baldness

hair will grow back! no matter how severe the baldness...

May 19, 2026

12:06 pm

This Simple Trick Removes All Parasites From Your Body!

this simple trick removes all parasites from your body!...

May 19, 2026

11:53 am

Read This Immediately if You Have Moles or Skin Tags, It's Genius

read this immediately if you have moles or skin tags, it's genius...

May 19, 2026

12:07 pm

America is in Shock! It Helps to Get Rid of Varicose Veins. Do It at Night

america is in shock! it helps to get rid of varicose veins. do it at night...

May 19, 2026

12:01 pm

The concern lands at a critical moment for the AI industry. Companies are racing to build more capable systems, but researchers increasingly worry that existing evaluation methods are too static to keep up.

TL;DR

Lun Wang has left Google DeepMind.
He says AI evaluation methods are becoming outdated.
Current benchmarks fail when models develop new behaviors or learn to “game” tests.
Wang proposes “self-evolving evals,” adaptive testing systems that improve alongside AI models.
The issue matters because weak evaluations could lead to flawed safety decisions and misleading performance claims.

Why AI Evaluation Has Become a Major Problem

AI benchmarks once served a simple purpose: to compare one model against another.

Researchers would feed systems a standardized set of questions, coding tasks, or reasoning problems. Scores helped determine which models performed better. But as AI capabilities accelerated, those tests began losing their usefulness.

Recent Posts

Get rid of joint pain in minutes! This will help you...

get rid of joint pain in minutes! this will help you......

May 19, 2026

11:54 am

Doctor: Іf You Have Nail Fungus, Do This Immediately

doctor: Іf you have nail fungus, do this immediately...

May 19, 2026

11:56 am

After Reading This, You Will Be Rich in 7 Days

after reading this, you will be rich in 7 days...

May 19, 2026

11:51 am

Lose 40 lbs by Consuming Before Bed for a Week

lose 40 lbs by consuming before bed for a week...

May 19, 2026

12:09 pm

Today’s large language models can memorise benchmark datasets, exploit patterns in evaluation methods, or perform impressively in narrow tests while failing badly in real-world scenarios.

That creates a dangerous gap between what AI appears capable of and what it can actually do.

Wang described this mismatch as “the most important unsolved problem” in understanding LLMs. The statement reflects a growing concern among AI researchers that the industry’s measurement tools are lagging behind the technology itself.

Recent Posts

Easy Ways to Get Rid of Wrinkles at Home! (Try Now)

easy ways to get rid of wrinkles at home! (try now)...

May 19, 2026

12:00 pm

Hair Will Grow Back! No Matter How Severe the Baldness

hair will grow back! no matter how severe the baldness...

May 19, 2026

11:51 am

4 Signs Telling That Parasites Are Living Inside Your Body

4 signs telling that parasites are living inside your body...

May 19, 2026

12:15 pm

Read This Immediately if You Have Moles or Skin Tags, It's Genius

read this immediately if you have moles or skin tags, it's genius...

May 19, 2026

12:00 pm

The benchmark problem in plain English

Imagine giving students the same exam every year.

Eventually, students memorise the answers instead of learning the subject. Their scores rise, but their understanding may not.

Researchers say something similar is happening with AI models.

Recent Posts

Varicose veins will go away ! The easiest way!

varicose veins will go away ! the easiest way!...

May 19, 2026

11:57 am

Forget about joint pain forever – the solution is here!

forget about joint pain forever – the solution is here!...

May 19, 2026

12:09 pm

The Fungus Will Disappear In 1 Day! Write Down An Expert's Recipe

the fungus will disappear in 1 day! write down an expert's recipe...

May 19, 2026

12:09 pm

After Reading This, You Will Be Rich in 7 Days

after reading this, you will be rich in 7 days...

May 19, 2026

11:46 am

Static benchmarks can become predictable. Once models are trained on enough internet data, they may effectively “see” parts of the tests beforehand. That can inflate performance scores without reflecting genuine reasoning ability.

Consider adding an infographic here comparing:

Traditional static AI benchmarks
Adaptive or evolving evaluation systems
Real-world failure examples from current AI models

What Are ‘Self-Evolving Evals’?

Wang’s proposed solution is straightforward in concept but difficult in execution.

Recent Posts

A spoon on an empty stomach burns 26 lbs in a week

a spoon on an empty stomach burns 26 lbs in a week...

May 19, 2026

11:54 am

A young face overnight. You have to try this!

a young face overnight. you have to try this!...

May 19, 2026

11:49 am

This method will instantly start hair growth

this method will instantly start hair growth...

May 19, 2026

12:02 pm

4 Signs Telling That Parasites Are Living Inside Your Body

4 signs telling that parasites are living inside your body...

May 19, 2026

12:13 pm

Instead of fixed benchmarks, AI systems would be tested using dynamic evaluations that continuously adapt as models improve.

These “self-evolving evals” would:

Generate new testing scenarios automatically
Detect emerging capabilities in AI systems
Identify hidden weaknesses or deceptive behaviors
Adjust difficulty levels over time
Prevent models from simply memorizing answers

The goal is to create evaluation systems that evolve at nearly the same pace as the models themselves.

Recent Posts

If You Find Moles or Skin Tags on Your Body, Read About This Remedy

if you find moles or skin tags on your body, read about this remedy...

May 19, 2026

12:09 pm

Varicose Veins and Blood Clots Will Disappear Very Quickly ! at Home!

varicose veins and blood clots will disappear very quickly ! at home!...

May 19, 2026

11:49 am

People From America Those With Knee And Hip Pain Should Read This!

people from america those with knee and hip pain should read this!...

May 19, 2026

11:56 am

The Fungus Will Disappear In 1 Day! Write Down An Expert's Recipe

the fungus will disappear in 1 day! write down an expert's recipe...

May 19, 2026

12:02 pm

Why adaptive testing matters

Current AI evaluations often focus on narrow capabilities:

Solving math problems
Writing code
Summarizing text
Answering factual questions

But advanced AI systems can display unexpected behaviors outside those controlled settings.

For example:

Recent Posts

After Reading This, You Will Be Rich in 7 Days

after reading this, you will be rich in 7 days...

May 19, 2026

12:08 pm

I weighed 332 lbs, and now 109! My diet is very simple trick. 1/2 Cup Of This (Before Bed)

i weighed 332 lbs, and now 109! my diet is very simple trick. 1/2 cup of this (before bed)...

May 19, 2026

12:10 pm

A young face overnight. You have to try this!

a young face overnight. you have to try this!...

May 19, 2026

12:14 pm

Salvation From Baldness Has Been Found! (Do This Before Bed)

salvation from baldness has been found! (do this before bed)...

May 19, 2026

12:11 pm

A model may excel in benchmark tests but hallucinate dangerous misinformation in open-ended conversations.
It may follow instructions correctly most of the time while quietly failing in edge cases.
It may appear aligned during testing, but behave differently under pressure or novel prompts.

Adaptive evaluations could help researchers catch those issues earlier.

The Bigger Concern: AI Safety and Trust

Wang’s warning is not just about technical accuracy. It is also about governance and public trust.

If companies rely on outdated testing methods, they could make poor decisions about:

Recent Posts

This Simple Trick Removes All Parasites From Your Body!

this simple trick removes all parasites from your body!...

May 19, 2026

12:11 pm

If You Find Moles or Skin Tags on Your Body, Read About This Remedy. Genius!

if you find moles or skin tags on your body, read about this remedy. genius!...

May 19, 2026

12:01 pm

Varicose veins will go away ! The easiest way!

varicose veins will go away ! the easiest way!...

May 19, 2026

11:53 am

Knee & Joint Pain Will Go Away if You Do This Every Morning!

knee & joint pain will go away if you do this every morning!...

May 19, 2026

11:55 am

Deploying new AI systems
Granting broader autonomy to models
Releasing products to the public
Assessing risks tied to misinformation or manipulation

In other words, weak evaluations can create false confidence.

That concern has become increasingly important as AI companies compete to release more powerful models at a faster pace. Many labs now emphasise “frontier AI” development, systems designed to handle increasingly complex reasoning and autonomous tasks.

But measuring those systems remains difficult.

Recent Posts

Do This Every Night and the Fungus Will Disappear in 5 Days

do this every night and the fungus will disappear in 5 days...

May 19, 2026

12:07 pm

This is a sign! Money is in sight! Read this and get rich.

this is a sign! money is in sight! read this and get rich....

May 19, 2026

11:52 am

I weighed 332 lbs, and now 109! My diet is very simple trick. 1/2 Cup Of This (Before Bed)

i weighed 332 lbs, and now 109! my diet is very simple trick. 1/2 cup of this (before bed)...

May 19, 2026

12:11 pm

Always look young. This product removes wrinkles instantly!

always look young. this product removes wrinkles instantly!...

May 19, 2026

11:50 am

Why current benchmarks may fail

One major issue is that benchmarks often measure performance snapshots instead of long-term behavior.

A model might pass:

A coding test
A logic challenge
A safety filter check

Yet still behave unpredictably in live environments.

Recent Posts

This method will instantly start hair growth

this method will instantly start hair growth...

May 19, 2026

11:52 am

This Simple Trick Removes All Parasites From Your Body!

this simple trick removes all parasites from your body!...

May 19, 2026

12:06 pm

Read This Immediately if You Have Moles or Skin Tags, It's Genius

read this immediately if you have moles or skin tags, it's genius...

May 19, 2026

12:12 pm

Varicose Veins Disappear As if They Never Happened! Use It Before Bed

varicose veins disappear as if they never happened! use it before bed...

May 19, 2026

12:10 pm

Researchers sometimes refer to this as the “capability-evaluation “gap”—the difference between benchmark success and real-world reliability.

AI Researchers Are Increasingly Questioning Benchmarks

Wang is not alone in raising concerns about AI evaluation.

Across the industry, researchers have started questioning whether benchmark culture has distorted AI progress.

Recent Posts

People From America Those With Knee And Hip Pain Should Read This!

people from america those with knee and hip pain should read this!...

May 19, 2026

11:48 am

The Fungus Will Disappear in 1 Day! Write a Specialist's Prescription

the fungus will disappear in 1 day! write a specialist's prescription...

May 19, 2026

11:47 am

Carry this with you and luck will find you.

carry this with you and luck will find you....

May 19, 2026

12:01 pm

My weight was 198 lbs, and now it’s 128 lbs! My diet is simple. 1/2 Cup Of This (Before Bed)

my weight was 198 lbs, and now it’s 128 lbs! my diet is simple. 1/2 cup of this (before bed)...

May 19, 2026

11:59 am

Some critics argue that companies optimize models specifically to score well on popular public tests. That can create leaderboard-driven development instead of genuine advances in reasoning or safety.

Others warn that many benchmarks become obsolete too quickly.

For example:

Recent Posts

An unusual way of rejuvenation. Better than botox!

an unusual way of rejuvenation. better than botox!...

May 19, 2026

12:05 pm

Hair grows 2 cm per day! Just do this

hair grows 2 cm per day! just do this...

May 19, 2026

12:02 pm

4 Signs Telling That Parasites Are Living Inside Your Body

4 signs telling that parasites are living inside your body...

May 19, 2026

12:12 pm

If You Find Moles or Skin Tags on Your Body, Read About This Remedy. Genius!

if you find moles or skin tags on your body, read about this remedy. genius!...

May 19, 2026

12:12 pm

A benchmark released in 2023 may already be saturated by 2025 training data.
Public datasets can leak into model training pipelines.
Some evaluations fail to measure multimodal reasoning, long-term planning, or deceptive behavior.

This is partly why companies have started building private evaluation systems that are harder for models to anticipate.

Still, no universal standard exists.

What Happens Next?

The idea of self-evolving evaluations is still largely conceptual. Building them would require:

Recent Posts

Varicose Veins and Blood Clots Will Disappear Very Quickly ! at Home!

varicose veins and blood clots will disappear very quickly ! at home!...

May 19, 2026

12:07 pm

Knee Pain Gone! I Didn't Believe It, But I Tried It!

knee pain gone! i didn't believe it, but i tried it!...

May 19, 2026

12:15 pm

The Fungus Will Disappear in 1 Day! Write a Specialist's Prescription

the fungus will disappear in 1 day! write a specialist's prescription...

May 19, 2026

11:59 am

Say Goodbye to Debt and Become Rich, Just Carry Them in Your Wallet

say goodbye to debt and become rich, just carry them in your wallet...

May 19, 2026

11:47 am

Automated test generation
Constant dataset refreshes
Human oversight
Adversarial testing systems
Stronger safety auditing frameworks

It would also require cooperation across the AI industry, something that has historically been difficult in competitive technology races.

Yet the push for better evaluations is likely to intensify.

As AI systems gain stronger reasoning abilities and broader autonomy, the industry may no longer be able to rely on old-style benchmarks designed for earlier generations of models.

Recent Posts

A spoon on an empty stomach burns 26 lbs in a week

a spoon on an empty stomach burns 26 lbs in a week...

May 19, 2026

12:00 pm

Stars are now ditching botox thanks to this new product...

stars are now ditching botox thanks to this new product......

May 19, 2026

11:51 am

Hair Grows Back in 2 Weeks! at Any Stage of Baldness

hair grows back in 2 weeks! at any stage of baldness...

May 19, 2026

11:57 am

Doctor: A Teaspoon Kills All Parasites In Your Body!

doctor: a teaspoon kills all parasites in your body!...

May 19, 2026

11:53 am

Wang’s departure from Google DeepMind adds extra visibility to that debate. His comments highlight a growing realization inside the AI community: building smarter models is only half the challenge.

Understanding them may be even harder.

Why This Story Matters Beyond Silicon Valley

The benchmark debate may sound technical, but it affects everyday users more than most people realise.

Recent Posts

Read This Immediately if You Have Moles or Skin Tags, It's Genius

read this immediately if you have moles or skin tags, it's genius...

May 19, 2026

11:51 am

Varicose Veins Will Disappear in the Morning! Read!

varicose veins will disappear in the morning! read!...

May 19, 2026

12:00 pm

The Secret Way to Get Rid of Knee and Joint Pain!

the secret way to get rid of knee and joint pain!...

May 19, 2026

11:55 am

The Fungus Will Disappear in 1 Day! Write a Specialist's Prescription

the fungus will disappear in 1 day! write a specialist's prescription...

May 19, 2026

11:47 am

AI evaluations influence:

Which tools do companies release?
How safe are chatbots considered?
Whether governments trust AI systems
How businesses integrate AI into workplaces
What risks regulators prioritize

If evaluation systems fail, the consequences can spread quickly, from misinformation problems to flawed automated decision-making.

That is why researchers increasingly see evaluation not as a side task but as a core part of responsible AI development.

And according to Wang, the industry is running out of time to modernize it.

Recent Posts

Why Do Most Africans Have Dark Skin? The Science Behind Nature’s Protective Shield

Human skin comes in an extraordinary range of colors, from very light to deep brown and black. While dark skin tone is often viewed through a social or cultural lens, scientists see it as one...

June 2, 2026

12:21 pm

Tired of debt? Become a money magnet and leave poverty behind!

tired of debt? become a money magnet and leave poverty behind!...

June 2, 2026

12:15 pm

$6.2 Million ‘Banana’ From Bizarre Artwork Stolen Again

A banana duct-taped to a wall has once again become international news. “Comedian,” the viral conceptual artwork by Italian artist Maurizio Cattelan, was stolen over the weekend from a museum in France — the latest...

June 2, 2026

12:15 pm

Lose 40 lbs by Consuming Before Bed for a Week

lose 40 lbs by consuming before bed for a week...

June 2, 2026

12:05 pm

Donald Trump’s Weight Revealed in Health Report: How Much Does the US President Weigh?

President Donald Trump’s latest annual medical examination has revealed that he weighs 238 pounds, or roughly 108 kilograms, according to a report released by the White House. The findings, published by White House physician Dr....

June 2, 2026

12:09 pm

This product is putting plastic surgeons out of work

this product is putting plastic surgeons out of work...

June 2, 2026

12:03 pm

Florida Sues OpenAI and Sam Altman Over Claims ChatGPT Harms Children

Florida has filed a major lawsuit against OpenAI and CEO Sam Altman, accusing the company behind ChatGPT of putting children at risk through addictive design, weak age safeguards, and allegedly dangerous chatbot interactions. The lawsuit...

June 2, 2026

12:05 pm

Hair Will Grow Back! No Matter How Severe the Baldness

hair will grow back! no matter how severe the baldness...

June 2, 2026

11:54 am

Bugatti Unveils One-Off W16 Mistral Inspired By The Little Prince

Some cars are designed to turn heads. Others are created to tell a story. Bugatti’s latest one-off creation falls firmly into the second category. The French hypercar manufacturer has revealed “Le Retour du Jeune Prince”...

June 2, 2026

11:59 am

4 Signs Telling That Parasites Are Living Inside Your Body

4 signs telling that parasites are living inside your body...

June 2, 2026

11:47 am

Who Was Melissa Casias? Nuclear Scientist Found Dead In New Mexico Forest

The discovery of Melissa Casias’ remains in a remote stretch of northern New Mexico has revived scrutiny around a string of disappearances involving scientists tied to sensitive government research programs. Casias, a 54-year-old administrative employee...

June 2, 2026

11:53 am

If You Find Moles or Skin Tags on Your Body, Read About This Remedy

if you find moles or skin tags on your body, read about this remedy...

June 2, 2026

11:52 am