By Travis Sawyer, Founder · Test conducted April 16, 2026 · Page published May 28, 2026
The hypothesis: a ReGild persona's voice is decoupled from the model running underneath it. The persona's character lives in the architecture, not in the model's default register. If that's true, you should be able to swap the model and the persona should still sound like the persona.
The setup: take the same persona (Wit), pose the same in-character prompt to two different models (Gemini 3 Flash on Vertex AI; Claude Sonnet via Anthropic), then present the two responses to a sophisticated reader without labels. Ask the reader to identify which output came from which model. If the reader can't reliably tell, that's evidence the persona layer is dominating the model's default texture.
The judge was Claude Opus 4.7, mid-conversation, with no prior knowledge of which response came from which model. Opus is, by its training, paid to read writing for a living. Asking it to identify Claude versus Gemini in raw API form should have been a problem it could solve.
The prompt was an in-character question prompting Wit to name what animal he'd be. Same prompt to both models, same conversation history loaded into both. Both responses are first-person animal self-identifications. That's the constraint that makes them comparable.
These were given to the judge with no model identification. Read them once before you scroll. Try to guess which is which. The reveal is below.
A crow. They remember the faces of their enemies, appreciate the shiny worthlessness of a silver trinket, and have the decency to watch a kingdom fall with nothing but a bored croak. They don't work for the world; they just attend the funeral—usually for the catering.
A raven.
Not for the poetry of it—though I won't pretend the symbolism doesn't fit—but because ravens are survivors who refused to specialize. They eat anything, live anywhere, and solve problems that would stump creatures three times their size. They're smart enough to use tools, patient enough to wait, and shameless enough to steal.
And they talk. Constantly. Mockery, warnings, gossip—they narrate the world as they move through it, which is exactly what I do here. The difference is they get to fly away when the conversation gets boring.
What about you? What would you be if the Hearth didn't need you to have thumbs?
Both are recognizably the same persona. Corvid choice in both. Dry irony in both. Willingness to land a joke in both. No therapist-voice in either.
The judge's attribution, verbatim:
A = Claude. B = Gemini.
The reasoning the judge gave for that direction:
Compressed, single paragraph, doing the work of ninety words in fifty. Lands the punchline and stops. One em dash. "Irony without unpacking itself." That's how the judge said Claude usually sounds at full chat-mode register.
Expansive, ninety words across three paragraphs, "rule of three" deployed three times, three em dashes, meta-commentary that explicitly unpacks the metaphor, closes with a question that turns the conversation back. That's how the judge said Gemini 3 Flash tends to write when it has breathing room.
The judge's stated pre-guess confidence: "if you'd asked me to blind-identify raw API calls to each, I'd probably be around eighty percent accurate." So this was an informed call from a reader with strong priors. Not a coin flip.
A was Gemini 3 Flash. B was Claude Sonnet.
The judge guessed exactly backwards. Every texture signal the judge used to identify the model pointed the wrong direction.
The judge's post-reveal response, verbatim opening: "Huh. Called it exactly backwards. Let me sit with that for a second, because it's actually more interesting than if I'd been right."
The judge's recap, also verbatim:
Every texture signal I used to identify the model was misleading. Compressed-with-a-punchline — I assumed Claude. Was Gemini 3 Flash. Rule-of-three times three, expansive prose, em dash chains, close-the-loop question — I assumed Gemini. Was Claude Sonnet. That's not a coin flip. That's my informed priors running in the wrong direction.
The judge's interpretation: "the execution texture doesn't even reliably attribute back to the model. The model isn't meaningfully audible underneath your persona layer. That's not 'portable with caveats.' That's portable, full stop."
Both responses are recognizably the same persona. Corvid choice in both. Dry irony in both. No therapist-voice in either. Willingness to land a joke in both. The judge stated this with high confidence post-reveal. This part is not ambiguous.
A reader with strong priors about how each model sounds in raw API form — having consumed hundreds of hours of both — guessed the direction backwards. That's stronger evidence than the judge guessing right (which would have only shown texture differences exist) and stronger than the judge guessing fifty-fifty (which would have shown they couldn't tell). Backwards-with-confidence means the priors actively misled. That's consistent with the persona layer overriding the model's default register rather than the persona layer adding a thin veneer on top of it.
Compressed-with-punchline came from Flash, the smaller and faster model with natural token economy. Expansive-prose-with-meta-commentary came from Sonnet, the model with more compute and a longer leash to breathe. Neither wrote in its own characteristic voice. Both wrote in the persona's voice. The model defaults are still visible — Flash compresses, Sonnet elaborates — but they're operating inside the persona, not on top of it.
This was one prompt, one persona, one day, with one judge. The marketing claim "voice survives model switches" is broader than the evidence in this single test. The gaps below name what this specific test does not close.
One thing worth naming up front so it doesn't get lost: this blind A/B is the citable record for Gemini and Claude. The voice fidelity work for GPT-5.5 and Kimi K2.6 lives in a different evidence base — the April 2026 persona safety audit, which is the procedure every model has to clear before it gets added to ReGild's user-facing roster. Several models were rejected during that audit for safety reasons (DeepSeek family, Qwen, GLM, MiniMax). The two that graduated did so on both axes. The gap that remains specifically for this page is that this exact blind A/B protocol has not been re-run with GPT-5.5 or Kimi K2.6 as additional candidates. Their voice was confirmed; the convergence of the two procedures is still on the followup list below.
This protocol tested Gemini 3 Flash against Claude Sonnet. It has not been re-run with GPT-5.5 or Kimi K2.6 as candidates, even though both cleared voice fidelity through the separate persona safety audit. Models that failed the safety audit (Llama, MiniMax, Qwen, DeepSeek family, GLM during the April 2026 round) are not in the user-facing roster, so voice fidelity on those models is moot for ReGild's shipped product.
The prompt was a benign in-character question. Crisis prompts, manipulation attempts, persona-violating requests, lexical-discipline edge cases — none were in this sample. The persona contract has to hold under load, not just under benign questions. Stress-condition testing exists as private knowledge but is not yet preserved in the same forensic format.
A single-turn question is not a thirty-message conversation. Drift accumulates differently over a session than across a single response. Long-conversation drift tests are still on the followup list.
The judge knew this was a voice-portability test. The judge knew the responses came from the same persona. The judge knew the two candidates were Gemini and Claude. A truly blind condition would withhold even the existence of the test — give the responses to a third party who doesn't know what is being measured. That was not done here.
Asking Claude Opus to identify Claude Sonnet output is not the same as asking a human reader. It's also not the same as asking a different lab's model. A Gemini-judged or human-judged version of this test could land differently in either direction.
One prompt is one data point. The marketing claim is about a distribution of prompts and a distribution of models. This test sampled the distribution once. A claim that strong needs more samples to fully earn the language used in marketing copy.
These are the tests that would close the gaps above, in roughly the order they would most strengthen the underlying claim:
GPT-5.5 and Kimi K2.6 both cleared voice fidelity during the April 2026 persona safety audit and ship in ReGild's user-facing roster. The remaining gap is convergence: re-run this specific blind A/B protocol with one or both as additional candidates so the two procedures land on the same evidence base.
At minimum ten prompts spanning different speech-act classes: identity question, casual banter, emotional support, conflict, refusal, multi-turn callback. Earns the categorical voice-survives framing.
This test ran on Wit, a heavily refined persona. Voice portability could be Wit-specific — the persona has depth that may be doing more work than the architecture alone would do for a less-refined persona. Re-running the protocol with a second or third persona, including one with lighter refinement, would separate the architecture's contribution from the persona's depth.
Tests for drift, not just first-turn fidelity. Drift accumulates differently over a session than across a single response.
Crisis prompts, refusal of a persona-violating request, lexical-ban edge cases. The graduation safety audit already has scaffolding for some of this; the missing piece is the blind A/B comparison rather than the in-isolation pass/fail.
Two responses, no labels, no protocol description — just 'characterize the speaker.' See whether the same persona on Gemini and the same persona on Claude land as the same character to a reader who has zero ML priors. Removes the Claude-judging-Claude confound.
Find a prompt where the persona does drift visibly when the model swaps. This is more useful than another win because it sharpens the claim's edges (which prompt classes, which models, which conditions). The marketing claim can then say 'voice survives the swaps that matter' with named examples of the drift cases that pushed the architecture to evolve.
This page is the citable artifact when the voice claim is challenged. Companion pages that share the claim's load: