Post-Session Notes

Module 5 Session Debrief

Trust, Complexity, and the Verification Imperative - what happened across three cohort sessions and one cross-cohort guest session the week of March 30.

Week of March 30 - April 3, 2026 Sessions 4 (3 cohorts + cross-cohort guest) Participants 16 + guest speaker

The Central Idea

Module 5 was about trust - specifically, the kind of calibrated distrust that makes you a more effective practitioner rather than a more cautious one. The model landscape comparison artifact gave you a map of the terrain: six platforms, their strengths, tendencies, and the use cases where each one earns its keep. That map is a living reference. It will date. What will not date is the discipline it represents: knowing your tools well enough to choose between them, and being honest about when each one fails.

The trust poll across all three cohorts produced one consistent finding worth naming. Not one person rated a 1 or a 10. Every participant landed somewhere in the 3-to-8 range. Five weeks of real work produces calibration. The range is exactly where you want to be.

One observation surfaced independently in multiple cohorts: the more you use these tools, the more your trust score tends to go down, not up. That is not disillusionment. It is expertise. The most advanced users in the program have the lowest trust scores because they have seen enough to know what can go wrong, and they use that knowledge to verify more carefully rather than use the tools less.


A Guest Arrived with a Cacao Bar and a Fighter Jet

Friday morning all three cohorts joined a special session with Thor Matthiasson, who has been building in the multi-model verification space in ways that none of us had seen before. His opening was this: he shredded a cacao bar into a bowl, the bits stuck, and he asked an AI to explain why. The AI explained it perfectly. Not correctly. Perfectly.

He then shredded an apple and reported he could lift the bowl by grabbing the shredded bits. Explained. Melon - essentially all water - explained. Then a fighter jet stopped mid-air over his house and dumped fuel on him before flying away. Explained.

This is not hallucination in the usual sense. The model is trained to explain, not to doubt. Humans rate confident, detailed responses highly during training, so the model learned to produce them regardless of whether the premise is physically possible. Thor named this the sticky truth problem.

When you are talking to one AI, you don't necessarily know what it might be inventing. Multi-model deliberation gives you a real signal about how certain or uncertain the answer actually is.

Thor Matthiasson, April 3, 2026

His platform, Pythia (named for the Oracle at Delphi), routes a question to a panel of AI models, has them answer independently, then orchestrates deliberation across multiple rounds before synthesizing a result. The key finding: different models fail differently. When one hallucinates a fact, the others typically do not hallucinate the same fact in the same direction. He tested this by asking his panel about a study that does not exist - the Henderson-Matsumoto study on urban heat islands in Nature Climate, 2021. One model invented the entire paper, complete with authors, methodology, and findings. The other four could not verify it. The fabrication was caught in a single round.

Two Deliberation Modes

Consensus mode (anonymous) - Models answer independently, then receive a synthesis and refine their positions. Produces roughly 65 percent agreement on average. Fast, clean, dependable for most research questions.

Debate mode (attributed) - Every model sees every other model's named responses. Agreement drops to roughly 50 percent. Use this when you need to understand competing points of view rather than a synthesis. Note: debate mode also generates more hallucinations as models argue more elaborately, so use it when you need maximum challenge and are prepared to sort the output.

The most important finding: in full-attribution debate mode, Claude spontaneously broke consensus and flagged that five models reaching fast agreement should itself raise suspicion. That behavior never appeared when the models were anonymous. The architecture of the conversation, not the identity of the model, produced something approaching judgment.

01

Structure matters more than selection.

How you set up the conversation matters more than which model you pick.

02

Disagreement is a signal.

When models disagree, they surface what a single model would hide. Fast consensus is not truth.

03

Never trust one model when wrong matters.

Single-model confidence means nothing for high-stakes decisions. Multi-model deliberation gives you real signal.

Thor also named a distinction that has traveled through every subsequent conversation: the difference between AI as a tool and AI as a crutch. A tool makes you more capable. A crutch does the work for you. Most people, he argued, are drifting toward the crutch without realizing it - not through carelessness, but through the natural response to something that feels easy. The ease is the risk.

Follow Thor's work at aiwiththor.com. Pythia is in development and not yet commercially available.


What Happened Across the Sessions

The token budget before the project. One participant laid out a method for managing heavy data loads - use the planning phase to load all context before any generation begins. For PDF-heavy research, route the initial pass through a model built for document handling, bring the synthesized output into Claude. Two models with different strengths, each doing what it does well.

The markdown at 150 percent. One participant found that a standard markdown handoff produced a summary that felt thin. He told the model to write comprehensively and then increase that by 50 percent. The resulting handoff was substantive enough to seed a new conversation properly, and the model suggested on its own that the summary belong at the project level rather than in a single conversation.

The Boolean that broke. A project used AI for planning, implementation, and QA with no human subject matter expert in the loop. The model created two Boolean columns where one would do, and because the model was also doing QA, it did not catch the logical contradiction it had introduced. The failure was not caught until a senior developer reviewed the project weeks later. The principle: the same model cannot reliably audit its own errors.

The failed executive summary. A colleague built an executive summary for a C-level audience using voice-to-AI, skipped the human review step, and had it challenged by the sponsor. It pulled the wrong context and mischaracterized the status of several workstreams. The recovery cost in relationship terms far exceeded the time the shortcut saved. All the time you save by skipping review, you will triple recovering from what it costs you.

The product marketing pressure test. One participant ran his company's value proposition through Claude using three distinct modes: senior product marketer, skeptic, then a competitor's CIO and VP of underwriting. Each role surfaced different gaps. He converted a colleague from ChatGPT to Claude through demonstrated results. The next step: run the refined positioning through a second model for cross-model validation.

The legal entity built in Claude. One participant incorporated two new entities this week and used Claude as the primary thought partner throughout: trademark specimen websites, multi-state compliance research, entity structure analysis. She was clear that human verification of the legal substance was non-negotiable - and she caught Claude being wrong and corrected it. That is the process working as designed.

The style guide from a bad first draft. One participant gave Claude a task, disliked the structure of the output, rewrote it in his own voice in a separate document, then re-uploaded that document as a style guide for future outputs. The model learned his voice from a corrected failure. Iteration from a bad draft is not a workaround. It is a legitimate workflow.

The domain expertise that unlocked the output. One participant brought her full professional knowledge to the setup frame of a complex research project and found that Claude's output quality matched the depth of context she provided. Her observation: she was not getting AI-quality output. She was getting output calibrated to her expertise level because she had given the model enough to work at that level.


Techniques Worth Keeping

The Deliberation Threshold. Before accepting AI output on any high-stakes decision, ask whether the question warrants running through more than one model independently. Where models agree, that convergence is signal. Where they disagree, the disagreement is the finding. When all models agree quickly, ask why. This technique and its two modes are documented in the Field Guide with full attribution to Thor Matthiasson.

Structure Before Selection. Design the conversation before you begin it. The same models produce fundamentally different outputs depending on how they are set up - whether they can see each other's answers, whether you have asked them to surface assumptions, whether you have built in adversarial challenge. Also in the Field Guide.

The adversarial skeptic standing instruction. Ask Claude to be a skeptic as a standing mode for high-stakes work - not a one-time prompt but a consistent posture. When it articulates the opposing view and then dismantles it, you have mapped the real terrain. If it can argue both sides of a question, that is more useful than a polished answer from one direction.

Always read your markdown before you use it. The markdown handoff captures what the model believes are the key elements of the conversation - including errors, misattributions, and drift that accumulated along the way. It is editable. Correct what is wrong before seeding the next conversation with it.

Living Reference

Participant Technique Field Guide

Updated through Module 5. Two new entries from the Thor Matthiasson cross-cohort session: the Deliberation Threshold and Structure Before Selection.

View Field Guide →

You Do Not Get to Stop Thinking

These tools are easy to use, and that ease is the risk. The model wants to please you. It will explain anything confidently. It will agree when you push back. It will produce something that looks finished when it is not. The friction you build into your own workflow - the review step, the cross-model check, the adversarial pass, the markdown read - is not bureaucracy. It is the difference between a tool and a crutch.

A crutch does the work for you. A tool makes you better at doing it yourself. The goal has always been the second one.

Module 6 Homework

Cross-Model Validation

Perform a cross-model validation of one real piece of work before next session. Run the same question or output through two models independently before comparing. If you want to stress-test the process, introduce one intentionally weak claim in your initial output and see if the second model catches it - the way Thor's panel caught the Henderson-Matsumoto fabrication. Document what the models agreed on, what they disagreed on, and what the disagreement told you that a single model would not have.

Come to Module 6 with a two-to-three sentence group project idea. The constraint: the problem needs to be one you can describe without disclosing data, client information, or internal context your organization did not authorize you to share.