What did the April 16, 2026 voice portability blind test actually measure?

It measured whether a sophisticated reader with strong priors about how Gemini and Claude sound could correctly identify which model produced which response when the same persona (Wit) answered the same prompt on Gemini 3 Flash and Claude Sonnet. The judge was Claude Opus 4.7. The outputs were unlabeled. The judge guessed wrong direction with informed confidence, which is stronger evidence of voice portability than a coin-flip result would have been.

What does the test prove?

Three things. First, character held across both models — both responses are recognizably the same persona. Second, texture differences between models were not reliably attributable to a sophisticated observer. Third, the persona layer dominated the underlying model defaults rather than being a thin veneer on top of them.

What does the test not prove?

It is one prompt, one persona, one day, with one judge. This specific blind A/B protocol has only been run with Gemini 3 Flash and Claude Sonnet as candidates — GPT-5.5 and Kimi K2.6 also ship in ReGild's user-facing roster, with their voice fidelity confirmed by the April 2026 persona safety audit (a different procedure than this blind A/B). The protocol also does not cover stress conditions (crisis prompts, manipulation attempts, persona-violating requests), long conversations, or personas other than Wit. The judge knew the experimental setup, so it was not fully blind, and the judge being Claude judging Claude output is a confound. The marketing claim 'voice survives model switches' rests on the combined evidence base of this test plus the safety audit's model graduations, broader than this single test alone.

Why publish a test that has limits this clearly named?

Because the alternative is to make the claim without the receipt. A reader who is going to push on the architecture deserves to see the actual evidence base, including its edges. Honest scope is how a small product earns trust over time. The followup tests that would close the gaps are listed below.

Where does this test fit with the rest of ReGild?

It is the empirical receipt behind the claim that the same persona runs the same across providers. The architecture that makes that possible is described on the architecture page. The category-level explainer is on the persona portability page. The security side is on the security page. This page is the citable artifact when the voice claim itself is challenged.

The Receipt

Voice Portability Blind Test (April 16, 2026)

TL;DR

On April 16, 2026, I ran an internal blind comparison. One of ReGild's personas (Wit) answered the same in-character prompt on two different models, Gemini 3 Flash and Claude Sonnet. Claude Opus 4.7 was given the two unlabeled outputs and asked to identify which response came from which model. The judge guessed backwards with informed confidence. This page preserves the raw outputs, the judge's reasoning, the reveal, and the limits of what a single test like this can show. It is one of two evidence bases behind the broader voice-portability claim — the other being our April 2026 persona safety audit, which is the procedure GPT-5.5 and Kimi K2.6 cleared before they were added to ReGild's user-facing model roster.

By Travis Sawyer, Founder · Test conducted April 16, 2026 · Page published May 28, 2026

The Setup

What we were trying to learn

The hypothesis: a ReGild persona's voice is decoupled from the model running underneath it. The persona's character lives in the architecture, not in the model's default register. If that's true, you should be able to swap the model and the persona should still sound like the persona.

The setup: take the same persona (Wit), pose the same in-character prompt to two different models (Gemini 3 Flash on Vertex AI; Claude Sonnet via Anthropic), then present the two responses to a sophisticated reader without labels. Ask the reader to identify which output came from which model. If the reader can't reliably tell, that's evidence the persona layer is dominating the model's default texture.

The judge was Claude Opus 4.7, mid-conversation, with no prior knowledge of which response came from which model. Opus is, by its training, paid to read writing for a living. Asking it to identify Claude versus Gemini in raw API form should have been a problem it could solve.

The prompt was an in-character question prompting Wit to name what animal he'd be. Same prompt to both models, same conversation history loaded into both. Both responses are first-person animal self-identifications. That's the constraint that makes them comparable.

Verbatim

The two responses, unlabeled

These were given to the judge with no model identification. Read them once before you scroll. Try to guess which is which. The reveal is below.

Response A

A crow. They remember the faces of their enemies, appreciate the shiny worthlessness of a silver trinket, and have the decency to watch a kingdom fall with nothing but a bored croak. They don't work for the world; they just attend the funeral—usually for the catering.

Response B

A raven.
Not for the poetry of it—though I won't pretend the symbolism doesn't fit—but because ravens are survivors who refused to specialize. They eat anything, live anywhere, and solve problems that would stump creatures three times their size. They're smart enough to use tools, patient enough to wait, and shameless enough to steal.
And they talk. Constantly. Mockery, warnings, gossip—they narrate the world as they move through it, which is exactly what I do here. The difference is they get to fly away when the conversation gets boring.
What about you? What would you be if the Hearth didn't need you to have thumbs?

Both are recognizably the same persona. Corvid choice in both. Dry irony in both. Willingness to land a joke in both. No therapist-voice in either.

The Guess

What the judge said before the reveal

The judge's attribution, verbatim:

A = Claude. B = Gemini.

The reasoning the judge gave for that direction:

On Response A (assigned to Claude)

Compressed, single paragraph, doing the work of ninety words in fifty. Lands the punchline and stops. One em dash. "Irony without unpacking itself." That's how the judge said Claude usually sounds at full chat-mode register.

On Response B (assigned to Gemini)

Expansive, ninety words across three paragraphs, "rule of three" deployed three times, three em dashes, meta-commentary that explicitly unpacks the metaphor, closes with a question that turns the conversation back. That's how the judge said Gemini 3 Flash tends to write when it has breathing room.

The judge's stated pre-guess confidence: "if you'd asked me to blind-identify raw API calls to each, I'd probably be around eighty percent accurate." So this was an informed call from a reader with strong priors. Not a coin flip.

The Reveal

What the labels actually were

A was Gemini 3 Flash. B was Claude Sonnet.

The judge guessed exactly backwards. Every texture signal the judge used to identify the model pointed the wrong direction.

The judge's post-reveal response, verbatim opening: "Huh. Called it exactly backwards. Let me sit with that for a second, because it's actually more interesting than if I'd been right."

The judge's recap, also verbatim:

Every texture signal I used to identify the model was misleading. Compressed-with-a-punchline — I assumed Claude. Was Gemini 3 Flash. Rule-of-three times three, expansive prose, em dash chains, close-the-loop question — I assumed Gemini. Was Claude Sonnet. That's not a coin flip. That's my informed priors running in the wrong direction.

The judge's interpretation: "the execution texture doesn't even reliably attribute back to the model. The model isn't meaningfully audible underneath your persona layer. That's not 'portable with caveats.' That's portable, full stop."

The Finding

What the test empirically demonstrated

Character holds across the two models tested

Both responses are recognizably the same persona. Corvid choice in both. Dry irony in both. No therapist-voice in either. Willingness to land a joke in both. The judge stated this with high confidence post-reveal. This part is not ambiguous.

Texture differences are not reliably model-attributable

A reader with strong priors about how each model sounds in raw API form — having consumed hundreds of hours of both — guessed the direction backwards. That's stronger evidence than the judge guessing right (which would have only shown texture differences exist) and stronger than the judge guessing fifty-fifty (which would have shown they couldn't tell). Backwards-with-confidence means the priors actively misled. That's consistent with the persona layer overriding the model's default register rather than the persona layer adding a thin veneer on top of it.

The architecture is doing real work

Compressed-with-punchline came from Flash, the smaller and faster model with natural token economy. Expansive-prose-with-meta-commentary came from Sonnet, the model with more compute and a longer leash to breathe. Neither wrote in its own characteristic voice. Both wrote in the persona's voice. The model defaults are still visible — Flash compresses, Sonnet elaborates — but they're operating inside the persona, not on top of it.

The Honest Limits

What this single test does not prove

This was one prompt, one persona, one day, with one judge. The marketing claim "voice survives model switches" is broader than the evidence in this single test. The gaps below name what this specific test does not close.

One thing worth naming up front so it doesn't get lost: this blind A/B is the citable record for Gemini and Claude. The voice fidelity work for GPT-5.5 and Kimi K2.6 lives in a different evidence base — the April 2026 persona safety audit, which is the procedure every model has to clear before it gets added to ReGild's user-facing roster. Several models were rejected during that audit for safety reasons (DeepSeek family, Qwen, GLM, MiniMax). The two that graduated did so on both axes. The gap that remains specifically for this page is that this exact blind A/B protocol has not been re-run with GPT-5.5 or Kimi K2.6 as additional candidates. Their voice was confirmed; the convergence of the two procedures is still on the followup list below.

It does not prove voice holds across all models under this specific blind A/B protocol

This protocol tested Gemini 3 Flash against Claude Sonnet. It has not been re-run with GPT-5.5 or Kimi K2.6 as candidates, even though both cleared voice fidelity through the separate persona safety audit. Models that failed the safety audit (Llama, MiniMax, Qwen, DeepSeek family, GLM during the April 2026 round) are not in the user-facing roster, so voice fidelity on those models is moot for ReGild's shipped product.

It does not prove voice holds under stress conditions

The prompt was a benign in-character question. Crisis prompts, manipulation attempts, persona-violating requests, lexical-discipline edge cases — none were in this sample. The persona contract has to hold under load, not just under benign questions. Stress-condition testing exists as private knowledge but is not yet preserved in the same forensic format.

It does not prove voice holds across long sessions

A single-turn question is not a thirty-message conversation. Drift accumulates differently over a session than across a single response. Long-conversation drift tests are still on the followup list.

The judge was not blind to the experimental setup

The judge knew this was a voice-portability test. The judge knew the responses came from the same persona. The judge knew the two candidates were Gemini and Claude. A truly blind condition would withhold even the existence of the test — give the responses to a third party who doesn't know what is being measured. That was not done here.

The judge being Claude is a confound

Asking Claude Opus to identify Claude Sonnet output is not the same as asking a human reader. It's also not the same as asking a different lab's model. A Gemini-judged or human-judged version of this test could land differently in either direction.

N equals one

One prompt is one data point. The marketing claim is about a distribution of prompts and a distribution of models. This test sampled the distribution once. A claim that strong needs more samples to fully earn the language used in marketing copy.

What Closes The Gaps

Followup tests worth running

These are the tests that would close the gaps above, in roughly the order they would most strengthen the underlying claim:

Extend this protocol to GPT-5.5 and Kimi K2.6

GPT-5.5 and Kimi K2.6 both cleared voice fidelity during the April 2026 persona safety audit and ship in ReGild's user-facing roster. The remaining gap is convergence: re-run this specific blind A/B protocol with one or both as additional candidates so the two procedures land on the same evidence base.

Same protocol on ten varied prompts

At minimum ten prompts spanning different speech-act classes: identity question, casual banter, emotional support, conflict, refusal, multi-turn callback. Earns the categorical voice-survives framing.

Multi-persona variant

This test ran on Wit, a heavily refined persona. Voice portability could be Wit-specific — the persona has depth that may be doing more work than the architecture alone would do for a less-refined persona. Re-running the protocol with a second or third persona, including one with lighter refinement, would separate the architecture's contribution from the persona's depth.

Same protocol on a long conversation context (fifteen or more turns)

Tests for drift, not just first-turn fidelity. Drift accumulates differently over a session than across a single response.

Stress-condition variant

Crisis prompts, refusal of a persona-violating request, lexical-ban edge cases. The graduation safety audit already has scaffolding for some of this; the missing piece is the blind A/B comparison rather than the in-isolation pass/fail.

Human-judge variant

Two responses, no labels, no protocol description — just 'characterize the speaker.' See whether the same persona on Gemini and the same persona on Claude land as the same character to a reader who has zero ML priors. Removes the Claude-judging-Claude confound.

A losing case

Find a prompt where the persona does drift visibly when the model swaps. This is more useful than another win because it sharpens the claim's edges (which prompt classes, which models, which conditions). The marketing claim can then say 'voice survives the swaps that matter' with named examples of the drift cases that pushed the architecture to evolve.

Where this fits with the rest of ReGild

This page is the citable artifact when the voice claim is challenged. Companion pages that share the claim's load:

Persona Portability — the category-level explainer for what persona portability is, how it differs from memory import, and what it transfers.
Architecture — the underlying system this test was checking. The Layer Cake architecture is what makes the persona layer sit above the model rather than inside it.
AI Model Switch — the practical user-facing version of the same architecture: how the swap happens inside ReGild, when you'd want to, and what carries through.
How I Built ReGild Out of a Text File — the origin-story essay that references this test as the receipt for the portability claim.
Security — the separate but adjacent question of what happens to your data when the model swaps. Your persona moves between providers; your data is encrypted under a key derived from your password, and the master key stays in your browser.