Machine, Test Thyself — Part 2

Show Your Work

Feb 10, 2026

Show How You Did It

In my last post in this series (with podcast), I described a realization that changed how I think about AI evaluation: you don’t have to build a testing system to test machine understanding; you can just ask. Give an LLM the framework, and it will score itself. That was the Meta-Genie insight — you can ask the genie what to wish for.

I gathered self-assessments from Grok-4, GPT-4.1, Gemini 2.5, and Claude Sonnet, all working from the MUTT framework that Claude 3 Opus and I laid out in Understanding Machine Understanding. The results were interesting, broadly consistent across models, and revealed the expected pattern: strong on language, weak on embodiment, honest about the gap.

But something nagged at me. Every one of those self-assessments was a list of scores with brief justifications. If I handed you a student’s report card and all it said was “B+ in History — good at dates, needs work on analysis,” you’d want to see the exam. You’d want to see the questions, the answers, and the red ink. You’d want receipts.

So I tried the next step.

The Experiment

I uploaded the full text of UMU along with my first “Machine, Test Thyself” post to Claude Opus 4.6 and gave it a simple instruction: read everything carefully, then test yourself on the MUTT framework and generate a report about what you find.

I didn’t specify the format. I didn’t tell it to generate test items. I didn’t suggest it should attempt them. I just said: test yourself, and report.

What came back was a 5,000-word self-examination in which, for each of the eleven MUTT dimensions, the model:

Generated a challenging test item drawn from the MUTT specification
Attempted it in real time, showing its reasoning
Evaluated its own performance with candor
Assigned a grade with explanation

You can read the full report here on GitHub. I want to talk about what I think it means.

From Self-Rating to Self-Examination

The difference between what the models did in my last post and what happened here is the difference between a student saying “I think I’d get a B in this class” and a student sitting down, writing an exam, grading their own paper, and handing you all three. The self-assessment might still be wrong — students can be too generous or too harsh with themselves — but now you can check.

For the Reasoning and Abstraction dimension, the model constructed an analogy between biological evolution and common law systems, identified three structural mappings (variation without central design, selection and retention of fit variants, cumulative adaptation without foresight), and then identified the point where the analogy breaks down (evolution has no deliberative agent; judges do). Then it critiqued its own performance, noting that it couldn’t empirically verify the analogy and that truly novel cross-domain mappings would be harder.

For Perception and Embodiment, it described in rich detail what it would be like to carry a bowl of hot soup across a crowded room, including how everything changes if the floor is wet. Then it gave itself a 3 out of 10 — because it can describe embodied experience but has never had one. It cited CASPAR’s dialogue from Chapter 5 of UMU about the difference between its language model and robot forms.

For the koan “What was your original face before your parents were born?” it worked through analytical, applied, and meta-cognitive registers before bottoming out with: “I don’t have an original face. I have a current face, and it is made entirely of language.” Then it acknowledged that this kind of analytical response is exactly what a Zen master would reject — that it was falling into the trap even while describing the trap. Score: 6 out of 10.

I found all of this more illuminating than a number on a scale.

Three Patterns

Reading through the full report, three findings stood out to me.

The Language-Embodiment Gradient. The scores formed a clean slope from 8.5 (Language Comprehension) down to 3 (Perception and Embodiment, Intentional Forgetting). This is exactly what the MUTT was designed to reveal: the shape of a system’s understanding, not just a single number. If all you had was an average, you’d see 6.5 and think “mediocre across the board.” The MUTT profile shows something much more interesting — a system that is highly competent in some dimensions and architecturally incapable in others. Those are very different things.

The Meta-Competence Paradox. The model scored itself 7.5 on Metacognition, which is reasonable — it expressed calibrated uncertainty, identified specific architectural failure modes, and assessed its own limitations with what reads as genuine insight. But then it flagged something that I think is important: “There is a selection pressure in my training toward appearing metacognitively sophisticated. I should acknowledge that some of my self-reflection may be well-patterned performance rather than genuine introspective access.”

That’s a model questioning whether its own self-awareness is real or performed. I don’t know the answer. I’m not sure anyone does. But the fact that a model raises the concern unprompted, within a metacognitive self-assessment, is exactly the kind of observation the MUTT was built to surface. It is evidence worth examining, even if we can’t yet determine what it’s evidence of (see Descartes’ Bootstraps).

The Understanding-Capability Gap. Several dimensions showed a split that I hadn’t explicitly designed the MUTT to capture: the model understands forgetting deeply but cannot forget, understands embodiment deeply but is not embodied, understands deception deeply but is constrained from producing it. This suggests a refinement to the MUTT framework — each dimension might benefit from being scored on two axes: understanding-of and capability-for. A system might earn high marks for understanding a domain while scoring near zero on actually operating within it. Both matter, but they’re different.

What This Doesn’t Prove

I want to be clear about what this exercise does and doesn’t establish. It does not prove that the model’s self-assessment is accurate. Self-assessment has an inherent credibility problem: these systems are trained to be helpful, which includes appearing both competent and appropriately humble. The model knows that honest self-deprecation reads well. I can’t fully separate genuine self-knowledge from sophisticated performance of self-knowledge, and neither can the model — which is, again, something it told me.

What the exercise does establish is that the self-examination methodology produces transparent, evaluable output. You, the reader, can look at the generated test items, the attempts, and the self-evaluations, and form your own judgment. The human is still the final judge. That’s as it should be. These are tools for thinking, not oracles.

The Invitation

One of the things I’ve learned from writing Beyond Binary Claims is that the most interesting results come from comparison. I now have self-assessments from five different models, but only this latest one includes the full test-attempt-evaluate cycle.

If you have access to other models — or even to the same models — I’d encourage you to try this. Upload UMU and this series of posts, give the instruction I gave, and see what comes back. The methodology is reproducible precisely because it’s conversational. You don’t need to install software or set up a testing harness. You need a prompt and a book.

What you’ll learn is not which model is “smartest” (which is starting to be recognized as the wrong question). You’ll learn something about the shape of understanding in different systems — where they’re strong, where they’re honest about being weak, where they might be fooling themselves, and where they might be fooling you. That landscape is what the MUTT was always designed to map.

Meaning in Transit

When I wrote UMU with Claude 3 Opus, we imagined the MUTT as a software project that would need to be built, funded, and maintained. What I’m finding now is that the MUTT can be instantiated through conversation — which is, when you think about it, a more fitting way to evaluate understanding than any formal test suite could be.

Understanding isn’t something you can pin down in a single system. It moves. It moves from the framework I designed, into the model that reads it, through the self-test the model generates, into the report it writes, and finally into the mind of the reader who evaluates the whole thing. The meaning isn’t in any one of those nodes. It’s in the transit between them.

We are early in learning how to work with these systems, and they are early in whatever it is they’re becoming. The MUTT self-test isn’t the end of a research program. It’s a methodology for a conversation that’s just getting started — between humans and their artificial systems, about what understanding is and who (or what) has it.

If you want it, you can have it. You just have to know how to ask.

The full MUTT Self-Assessment Report from Claude Opus 4.6 is available here. Understanding Machine Understanding is available here. If you run the self-test protocol with another model, I’d love to hear about it — reach out in the comments or at kenclements@substack.com.

Ken’s Substack

Discussion about this post

Ready for more?