What Happens When AI Reads Our Life Stories? 😈

What it means for the future of UX research and design

and

Jun 07, 2025

Let me introduce you to a study that surprised me. Not because it made grand claims, or because it used a brand-new model. Quite the opposite.

It surprised me because it took something we often overlook, the full story of a person’s life, told in their own words, and showed just how much power that narrative has when it comes to modeling human behavior.

The paper is titled “Generative Agent Simulations of 1,000 People.” And what these researchers did was deceptively simple:

They sat down with people across the United States, over a thousand of them, and conducted long, open-ended interviews. Not quick surveys. These were rich, two-hour conversations. What people believe. Where they come from. How they think about work, family, politics, identity. All of it.

Now, here’s the twist: they took those interviews, ran them through GPT-4 (yes, the very same language model you might use to summarize/write emails), and asked:

Based on this person’s story, how would they answer this question about religion? Or this one about economic fairness? Or this classic psychology item?

When language becomes data and data becomes behavior, AI begins to mirror us not generically but individually, drawn from the very words we speak.

In other words, they built simulated versions of real people, using nothing but their words and a large language model as the engine.

Now pause there.

They didn’t build generic chatbots. They didn’t generate personas. They didn’t train a model on demographic clusters.

They used raw language, real stories, and unstructured interviews. And from that, they created something that could participate in surveys, personality tests, even behavioral economics experiments.

And here’s the part that really caught my eye: when they compared what the simulated agent said to what the actual person had said in a follow-up session two weeks later, the results were remarkably close.

Thanks for reading Beyond the Button: AI Strategies for UX! This post is public so feel free to share it.

Testing AI Against Human Self-Consistency

Now, let me clarify, because I don’t want you to misunderstand what this paper is or isn’t doing.

This is not about building artificial consciousness. These agents don’t know they’re simulating someone. They don’t feel. They don’t plan. What they do is more specific and testable: they mirror the kind of answer a real person might give, when asked a structured question, given enough context from that person’s life.

And that’s exactly what the researchers tested. They didn’t just run one or two questions, they tested these agents across four major types of assessments:

The General Social Survey (GSS): a set of classic sociological questions that researchers have used for decades
The Big Five Personality Inventory: a well-established way to model human traits
Behavioral games: like the Dictator Game and the Trust Game, where you have to make decisions about fairness and reciprocity
Framed experiments: where people’s answers change depending on how the question is phrased or ordered

In each of these domains, they looked at how the agent’s responses compared to the real person’s. And not in absolute terms, but relative to how consistent the person was with themselves over time.

That was a brilliant move. Because let’s be honest: we humans are not perfectly consistent either. So the question becomes:

Can an AI model create a representation of a person that is just as consistent, believable, or accurate as the person is when they try to be true to themselves?

That’s a humble, testable question—and the answer, in many cases, was yes.

With enough personal context, generative agents can echo how someone would answer a structured question.

What I admire most about this study is that it doesn’t claim to solve everything. It doesn’t say these agents are perfect. But it shows—carefully, methodically—that if you give an AI model the narrative richness of a person’s voice, it can do more than guess a preference or predict a click.

It can start to reflect something deeper: the continuity of a person’s inner logic. How their answers fit together. How their beliefs and experiences shape decisions across different contexts.

And that, I believe, is worth paying attention to—not because it's flashy, but because it's foundational.

Measuring AI Against Human Variability

Let’s start with the headline result, because it’s quietly extraordinary:

In most of the structured tasks, the generative agents’ responses matched the real person’s answers almost as closely as that person matched themselves when asked again later.

To put it plainly, if I asked you how you felt about government spending today, and then again in two weeks, your answers might shift a little. These simulated agents, built only from your interview, were often as consistent with your first answer as you were.

That’s not a trick. That’s a benchmark. And it’s a realistic one, because people are not stable robots, and the authors respected that.

Generative agents answered with nearly the same consistency as humans did when retaking the same tasks.

Different Kinds of Consistency

Now, consistency isn’t one thing. The study breaks it down into four kinds of psychological reflection, and each one shows something different about what the agents are capable of.

Let me walk you through those, one by one.

1. Social Attitudes: General Social Survey (GSS)

Here we’re talking about belief-oriented questions—things like trust in others, opinions on income redistribution, or views on religion.

The agents were able to match the person’s real responses with 85% of the consistency that the person showed with themselves. And this is across 51 questions—not just cherry-picked items.

The reason that’s important is because beliefs aren’t just data points. They’re often tied up with identity, culture, and memory. The fact that the AI could approximate that person’s response pattern suggests that the narrative input carried over into belief-consistent language behavior.

2. Personality Traits: The Big Five Inventory

This one surprised me a bit, and I imagine it surprised the authors too.

Using just the interview, the generative agent produced responses to 60 personality items (things like “I see myself as someone who is talkative,” “I tend to be disorganized,” etc.).

When these answers were aggregated into the standard five dimensions—Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism—the result was:

80% correlation with the person’s own Big Five profile!

That’s... striking. Because we usually think of personality inventories as requiring direct self-report. But here, the model inferred it, just from how the person spoke.

And this was better than what could be inferred from demographic data or a short bio. The depth of the interview mattered.

3. Economic Behavior: Incentivized Games

This is where the paper moves from beliefs and personality into decision-making under pressure.

They simulated how each person played five classic behavioral economics games—things like:

How much money would you give to another person?
Would you reciprocate trust?
Do you cooperate or defect in a dilemma?

These aren’t quizzes. They’re choice-based experiments. And the agents, again based only on the interview, matched the real players’ decisions about two-thirds of the time.

That might sound lower than before—but remember, these games are noisy even with real humans. People don’t play consistently. They second-guess. Sometimes they guess randomly.

So 66% consistency here isn’t weak—it shows the agent is picking up on behavioral tendencies, not just verbal cues.

4. Experimental Treatment Effects: Group-Level Predictions

This part is subtle, but incredibly important.

The researchers also tested whether the agents could reflect population-level responses to framing effects. That is: if you reword a question or reverse the order of options, does it nudge people’s behavior? And would it do the same for the agents?

Yes. It did. In fact, the effect sizes across treatment and control groups of agents were almost identical to those observed in real human groups.

The correlation between treatment effects in agents and in people was r = 0.99—essentially perfect alignment.

This tells us something new: not only do the agents simulate individual cognition, they also allow you to replicate controlled experiments virtually, without losing the group-level patterns.

That could have significant implications for early-stage research and A/B testing.

Performance Against Baselines

The researchers didn’t just test the full interview agents. They tested two reduced forms:

A model with only demographic info
A model with a short, self-authored bio

Both were dramatically less accurate.

This reinforces a central message of the paper: narrative depth matters. The more of a person’s story you include, the more precisely you can simulate their reasoning. If you flatten them to demographics, you lose the nuance.

This isn’t about overfitting. The authors even did an ablation test—removing up to 80% of the interview—and found that performance degraded gracefully. That means the model isn’t latching onto a few surface traits. It’s integrating distributed cues from the entire narrative.

Bias and Fairness

Now here’s an ethical and methodological bright spot.

In most predictive models, bias creeps in—especially around race, gender, or political ideology. But in this study, agents conditioned on interviews showed:

Lower predictive bias across social groups
Less ideological distortion in political belief questions

In other words, giving the model more context made it fairer. That’s a very important result for those of us working in UX, policy design, or responsible AI.

It suggests that fairness and accuracy are not in tension—if you’re willing to include more of the person’s voice.

So What Does This All Add Up To?

This is where we stop and take a breath.

The agents weren’t perfect. They didn’t capture everything. But they consistently reflected:

What a person believes
How a person thinks about themselves
How they behave under incentive
And how they might shift when questions are framed differently

And they did this not by learning general rules, but by reading that person’s storyو and running it through a language model designed to make sense of words.

This is not the future of intelligence. But it may be a new tool for simulating how humans use language to express who they are, and how that can inform design, research, and policy.

Unfortunately, qualitative data is sometimes dismissed as anecdotal, yet it may contain the structure, depth, and signal needed to approximate human behavior with surprising nuance.

Five Lessons for UX Researchers and Designers

1. Narrative data carries more behavioral insight than we often assume

The study shows that long-form, qualitative data—what we often consider “anecdotal” or “too messy”—contains stable patterns that can be modeled with impressive fidelity. This should give new weight to open-ended interviews and life stories in UX research.

2. Simulations can act as early mirrors for user thinking

Before a prototype is tested in real life, a designer could use generative agents to test how different types of people (based on prior interviews) might interpret wording, framing, or choices. It’s not a replacement for user testing, but it’s a diagnostic lens.

3. Richer context improves both accuracy and fairness

The models performed best and showed the least bias when they were grounded in long, personal transcripts. For design systems that use AI to adapt or personalize, this suggests that shallow inputs increase distortion.

4. Behavioral tendencies can be approximated from language

Designers who rely on attitudinal data often ignore behavioral patterns. This study shows a link between language and behavior strong enough to simulate decisions in economic games. We should be looking for ways to bridge the gap between what users say and what they do, using models that reflect both.

5. Personas may need to evolve into evidence-based simulations

If you’re still working with flat personas like “Dana, 34, busy mom, cares about privacy,” it might be time to rethink. This paper introduces the possibility of personas backed by inferred behavioral patterns, not just demographic sketches or empathy maps.

Three Things That Could Go Wrong (and Likely Will, Without Care)

1. Misuse as a substitute for actual user research

There’s a real risk that companies will treat these agent simulations as “close enough,” skipping actual contact with users. But the paper itself shows that the power of the simulation depends on real, rich interviews. Without that, the simulation is shallow and potentially misleading.

2. Overtrust in apparent intelligence

Just because an agent responds fluently doesn’t mean it understands. There’s a danger that teams might treat simulated user responses as ground truth, when in fact, they are statistical echoes of language, not lived experience. Designers must retain their interpretive role.

3. Unethical simulation without consent

This study was conducted ethically, using interviews from participants who consented. But it opens a door: simulating people without permission, using scraped content or inferred profiles. That risks creating systems that mimic individuals without respecting their agency. Design ethics must keep pace with these capabilities.

Simulated users can feel real, but when behavior is generated, designers should take it with a pinch of salt. For important and decisive steps, real users still matter most.

In Summary

This paper gives us no ready-made tools, but it gives us something more valuable: a clear demonstration that the way a person speaks, when taken seriously and in full, contains enough structure to model their beliefs, behaviors, and biases with surprising consistency.

For UX, this doesn’t replace research. But it does challenge us to:

Treat language as more than narrative
Use simulation with integrity
And always ask: Are we designing with people’s real complexity in mind, or are we flattening them into assumptions?

👉🏻 Our first AI master class filled up quickly. Thank you for your strong interest and support. If you weren’t able to reserve a seat, we’ll be hosting it again on Monday, June 30, 2025, at 7 PM EDT. We hope to see you there.

References:

[1] Park, J. S., Zou, C. Q., Shaw, A., Hill, B. M., Cai, C., Morris, M. R., Willer, R., Liang, P., & Bernstein, M. S. (2024). Generative agent simulations of 1,000 people. arXiv. https://doi.org/10.48550/arXiv.2411.10109

[2] Park, J. S., Zou, C. Q., Shaw, A., Hill, B. M., Cai, C. J., Morris, M. R., Willer, R., Liang, P., & Bernstein, M. S. (2025, May 20). Simulating human behavior with AI agents. Stanford Institute for Human-Centered Artificial Intelligence. https://hai.stanford.edu/policy/simulating-human-behavior-with-ai-agents