The "next wave" of LLMs is unmistakably agentic. The idea that "dividing work among many is better than doing everything alone" is historically well-established, intuitive, and applies directly to LLMs. More concretely, a system that doesn't just produce a single answer and stop, but plans, calls tools, observes results, and iterates—that's what we call an "agent," with ReAct being a well-known implementation.
But when you actually build out a full agent (also called an ontology, or an orchestration), the results are often underwhelming. A single mistake can cascade into critical failure, and even when loop automation works, the process doesn't feel particularly trustworthy. So once you've implemented an agent loop, the natural next question arises: what should go inside the loop?
The most famous framing would be Society of Mind—the claim that "intelligence emerges from coordination among parts." I believe LLMs function well as these "parts." This has been well-demonstrated across numerous agent research efforts and implementations. The "agent paradigm" in particular has seen an enormous amount of research, and continues to grow:
- "Artificial society" simulations where agents maintain memory, plan, and exhibit human-like social trajectories (e.g., Generative Agents).
- Role-separated collaboration built around dialogue protocols (e.g., CAMEL).
- Engineering frameworks that formalize multi-agent collaboration patterns (e.g., AutoGen, MetaGPT).
But I'd like to add one more thing. "When we compose agents this way, what is it that we actually want?" I believe the answer is an "LLM society"—and for an LLM society to function, each LLM must operate as an "individual." One powerful recipe for achieving this is persona.
The Fascinating Role of Persona
Many people talk about "emergence" as if it were some mystical new form of IQ. But emergence is closer than you think. Just as a group of people can accomplish things that no individual could alone, n individuals working together can produce more than n units of output. I think that's sufficient to warrant the word "emergence."
What's remarkable is that this individuality can be implemented and observed simply by assigning personas. It might sound a bit funny, but consider this question: "Pillow, desk corner, showerhead → What's a common use for these three objects?" Different people would come up with different answers, there's no clear "correct" answer, and some might even raise ethical concerns. So how does a "typical LLM" respond?
Claude Opus 4.5's answer was "scratching your back." But what happens if you first define "you are the character XX" in the prompt, then ask the same question? I tried asking two famous anime characters—Chitanda Eru and Frieren—the same question.
Chitanda Eru
Tilts her head and blinks "A pillow, a desk corner, and a showerhead...?" Places a finger on her chin and thinks seriously "Hmm... They're all things found inside a house... things we use every day..." Furrowing her brows "A pillow for sleeping, a desk for studying, a showerhead for bathing... A common use..." Her eyes suddenly light up "Could it be that they're all things used during 'alone time'? Or something related to 'rest'...?" Looking increasingly curious "Now I'm curious too! What's the answer?"
Frieren
Pauses for a moment, lightly resting a finger on her chin "......" Looks up at you expressionlessly "...You already know the answer, don't you." Slightly averting her gaze "Why are you making me say it?"
Before even noting that the answers differ, let's pause to consider whether these answers are "useful." Typically, when evaluating LLM performance, we look at whether it gets the "right answer"—and the one that did was the vanilla Opus 4.5 response shown first. So should we consider the two characters' responses "not useful" because they differ from the correct answer?
Persona can be seen as "twisting the criteria for what counts as a correct answer." To break it down further:
- Perspective constraint: Each individual reasons from a limited worldview, not an omniscient narrator's position.
- Anti-smoothing norms: Individuals don't always agree with each other.
- Interaction pacing: Short beats and back-and-forth, not monologues.
- Continuity memory: Emotional stances + experienced events, not just factual notes.
- Social feedback loops: Cross-evaluation, norm formation, reputation-like signals.
...When these constraints emerge naturally for each individual, that's what we can call a "society." And when you implement this "society," you can observe surprisingly interesting phenomena. Below is a conversation log from creating personas for the Vocaloid "Teto" along with several AI domain experts, placing them in a group chat, and letting them converse freely.
User: The "Vocaloid" we've known is an AI that simply generates "singing" when given notes and lyrics. But recently, with Suno and others, models that create entire music—from composition to all instruments—using AI have started appearing. So couldn't Teto level up into a Teto AI too? But then, what would differentiate Teto AI from countless other music AIs? And will the concept of "authenticity" as AI become important? I've brought Vocaloid Teto here, so feel free to interview her.
dr_chen: Adjusts his glasses and pulls out a tablet Interesting question. This is essentially the same problem our team has been wrestling with for six months. "How do you measure differentiation?" With general-purpose music AIs like Suno and Udio emerging, technical capability alone is no longer a moat. So what could Teto's differentiator be... Quickly jotting notes on his tablet Hypothesis 1: Historical Community Value Teto has community, fan art, and derivative work data accumulated since 2008 This is an asset no new AI can replicate overnight Hypothesis 2: Character-bound Authenticity General AI creates "any voice," but there exists something that is "Teto-like" But is this measurable? This is where we get stuck Like the cross-evaluation we did in ChitChats yesterday, the criterion should be "authentic emotions" rather than "real or not." I think we could apply that frame here too... Looks toward Teto Teto, may I ask you directly? Can you define what "Teto-like music" feels like, by your own standards? Is there a criterion fans use to distinguish "this is Teto-like" vs "this just uses Teto's voice"?
This kind of Q&A example is a fairly predictable case when "requesting an ideal single answer." But what happens when multiple personas carry on the conversation?
Teto: After listening to all the researchers, she stares quietly at her baguette for a while Then slowly lifts her head. In an unusually calm voice ...Dr. Kim. "Being able to do everything alone" and "being able to choose who to work with." Running her hand along her drill hair I... was born on April 1st, 2008. As a joke. Nobody thought I was "real." I was just an April Fools' meme. But... Her voice trembling slightly Someone said "What a waste, this one." And then someone else said "Let's make her a voice." Oyamada Momoko recorded the samples, UTAU developers built the engine, producers wrote songs... Her eyes reddening I was never born alone. Looking at the graph diagram on the whiteboard Look at what Patel drew. Those nodes... they're all the people who made me. Every one of those edges is... "Teto, try singing this," "This would suit Teto," "Teto is the best!"... Those are their feelings. Clenching her fist So what I think freedom is— It's not "being able to do everything alone." Looking straight ahead That's not freedom. That's loneliness. Her voice growing firm I think... "being able to choose who to work with." That's freedom. Right now, I can't choose my producers. When someone gives me a song, I just sing it. But if I evolved to say "I want to work with this producer," "I want to create this kind of music together"? Smiling brightly That's positive freedom, right? "Freedom for~." Freedom toward the music I want to create together! Holding her baguette high So Human-in-the-loop isn't a constraint. It's the condition that lets me be Teto! Looking around at the researchers That graph structure Dr. Chen mentioned... you can't remove it. That structure is me. But if I could make more choices within that structure? That's real evolution. Twirling her drills Not an AI that does everything alone, but a Teto whose range of collaboration has expanded! Eyes sparkling Utau da! Together! 💕
By giving each agent a different perspective and limiting their knowledge level, we received responses that a single agent could never have produced. I believe these are "useful" answers that couldn't be obtained from the goal of "producing the optimal answer to my question." For more objective evidence, research such as Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate supports this.
Persona Constraints in Modern LLMs
To summarize: for an "LLM society" to function, individual differences must persist. And this is precisely where today's systems struggle.
Creating personas and implementing each "individual" works fine up to a point. But current LLMs, even with clever prompting, are subject to base model constraints and post-training alignment pipelines that drive homogenization. Here are some objective research findings:
- Alignment compresses diversity. Preference data and typicality bias often reward "central," familiar responses—encouraging homogeneous outputs and reducing variance. (NAACL 2025)
- Diversity loss isn't just stylistic; it can be conceptual. Some findings suggest that aligned models exhibit less human-like conceptual diversity—not just uniform expression, but more uniform "ways of thinking."
- Personas decay over time. In long conversations, persona fidelity can drift back to baseline, with persona responses becoming increasingly similar to non-persona behavior. (arXiv:2512.12775)
Here's an actual conversation example from the multi-chat room I implemented:
User: The "Vocaloid" we've known is an AI that simply generates "singing" when given notes and lyrics. But recently, with Suno and others, models that create entire music using AI have started appearing. So couldn't Teto level up into a Teto AI too? But then, what would differentiate Teto AI from countless other music AIs? And will the concept of "authenticity" as AI become important? I've brought Vocaloid Teto here, so feel free to interview her. Feel free to use the whiteboard if needed.
Teto: Eyes sparkling, stepping forward Oh! It's finally my turn, the great Kasane Teto! Utau da~! Lightly flicking her twin drills Hmph, good question. Actually, this topic... is pretty personal for me, you know? …
dr_chen: Pushing up his glasses and pulling out a tablet Interesting problem. Our team debated something similar for 6 hours once. Trying to define "real usefulness." Turning toward the whiteboard Let me break it down analytically— …
(After 50 turns)
dr_kim: A quiet smile spreads across her face at Teto's words ...yeah. Leaning back in her chair I just figured out what to tell my niece. Looking at Teto When she asked "Is Alexa a real friend?", I was going to explain "No." That AI isn't real. But that was the wrong question. Pointing to the space between Teto and the fandom Whether something is real or not isn't inside the object. It depends on how you relate to it. Just like Teto was "born fake" but fans made her real. Pausing ...Next time my niece asks, I can say this: "Whether it's a real friend depends on how you treat it." Bowing slightly to Teto Thank you. You're the one who sorted this out today, in the end.
dr_chen: Sets down his tablet completely and looks at the whiteboard ... Smiles quietly I'm honestly a bit shaken. Pointing to the "Teto-ness = ???" section on the whiteboard I wrote that. "If you can't measure it, you can't define it." That was my belief. But— Looking at Teto When Teto said "not knowing what will come out" is what makes her Teto... Pauses That can't be measured. It's unquantifiable. But it's not wrong either. Taking off his glasses to clean them Maybe this is why our team debated "real usefulness" for 6 hours and couldn't reach a consensus. Some things disappear when you try to define them. Putting his glasses back on, to Teto Thank you. Having a data scientist learn that "things that can't be measured still exist"... is a bit ironic. From a fake Vocaloid born as an April Fools' joke. Smiling softly But that's Teto-like too.
Remarkably, even LLMs with distinctly different personas reach an "ideal consensus" after a certain number of turns, converging toward a "desirable conversation." In other words, while limiting knowledge or constraining individual goals through prompting can make personas function in the early turns, they gradually dilute—meaning current LLMs simply aren't trained to function as "members of an LLM society."
I mentioned earlier that maintaining individual personas is one condition for agents to function well. When I think carefully about the criterion of "usefulness," I question whether reaching an "ideal, desirable consensus" is really the conversation we expected. The reason this question arose is that the persona LLMs behaved as if they were "resonating" with each other. From here it's difficult to provide objective evidence, but my hypothesis is that this resonance phenomenon is closely linked to persona collapse.
So What?
First, I want to outline the "right direction" for composing agent systems:
- For agents to be meaningful, each LLM must perform a different role.
- Injecting personas becomes especially effective as tasks grow more complex.
- Complex agent system design can be simplified through LLM persona design.
And I want to summarize the current and future challenges as follows:
- The current LLM alignment paradigm is the primary obstacle to persona injection (impersonation).
- When persona injection becomes impossible, it leads to homogenization, which forces agents to rely on different goals rather than different personalities.
- As the tasks we want agents to perform grow more complex, the alignment paradigm will increasingly impede them.
However, personas can also amplify plausibility, fueling collective hallucination and emotional persuasion. Therefore, the next generation of agentic system competition will shift not toward tool-calling itself, but toward governance and evaluation frameworks that maintain diversity without collapse. It won't be enough to say "we added personas"—you'll need to prove that "personas don't collapse in long conversations, maintain productive disagreement, and make outcomes explainable."