Japan Synthetic Consumer Personas — open dataset (N=3,000) + real-answer calibration

The approach

Synthetic is fast. Real is the ground truth.

A synthetic panel is a statistically grounded model of Japanese consumers — free, instant, ideal for wide early exploration. But a model is not the people themselves: the consumer-behavior layer is aligned with the direction of official statistics, not yet calibrated against real category-level answers. So this is an open project with two layers, run in a loop.

Layer 1 · Free

Synthetic panel

3,000 personas grounded in Japan's demographics and household-income distributions. Query, segment, and simulate in seconds.

Instant, unlimited, zero cost
Wide early exploration & pre-testing
First-person narratives for high-fidelity prompting
Directional — not yet calibrated to real purchases

⇌

Calibrate

real answers sharpen the synthetic

Layer 2 · Real answers

Real Japanese respondents

Put the same question to real people when a decision rides on it — no full research project to stand up.

Ground truth for the decisions that matter
Individual answers + segment-level aggregates
Target by age, gender, region, occupation, income
Every answer calibrates the open dataset

from $0.30 / answer

How it's built

A reproducible, 3-layer cascade.

Every persona is grounded in public data with fixed seeds, so the whole pipeline reproduces. Base personas come from NVIDIA's census-grounded set; income and consumer behavior are conditioned on Japanese government statistics.

L0

Population

Stratified sample by age-band × sex to match national population proportions → 3,000 personas.

Base · NVIDIA Nemotron-Personas-Japan

L1

Income grounding

P(income | head-of-household age) combined with a prefecture income index — joint age × region conditioning.

e-Stat · MHLW & MIC surveys

L2

Consumer layer

Income-tier-conditioned price sensitivity, brand orientation, channels & EC — both poles always kept, no homogenization.

This dataset · statistics-aligned

L3

First-person narrative

A name-based, first-person life story per persona — attributes dissolved into a life, not a label list.

This dataset · backstory_250w

A · B

Persona & demographics
(Nemotron-Personas-Japan)

uuidsexageage_bandmarital_statuseducation_leveloccupationprefectureregionpersonahobbies_and_interestsskills_and_expertise …and more

C

Household-income grounding
(e-Stat official statistics)

household_income_brackethousehold_income_midpoint_manyenincome_tierhousehold_income_source

D

Consumer-behavior layer
(this dataset)

price_sensitivitybrand_orientationpromotion_responsivenessbulk_buy_tendencyec_adoptionprimary_purchase_channelsmedia_contactdisposable_income_feel

E

First-person narrative
(this dataset)

backstory_250w — a ~220–260 char first-person account of daily life (avg 278.8 chars).

Why it works

First-person narratives, not attribute lists.

Condition an LLM on a demographic list alone and it drifts to the population average and reproduces stereotypes — collapsing the diversity simulation depends on. A concrete life story conditions the model far more richly. This is a design choice with a research basis.

01

Algorithmic fidelity

Conditioning a model on detailed real backstories lets it emulate the response distributions of many human subgroups — the basis of “silicon sampling.”

Argyle et al. (2023) · Political Analysis 31(3), 337–351 · doi.org/10.1017/pan.2023.2

02

Backstories beat attribute lists

Open-ended, naturalistic backstories yield more consistent and representative virtual personas — up to +18% representativeness and +27% consistency on Pew benchmarks.

Moon et al., “Anthology” · EMNLP 2024 · aclanthology.org/2024.emnlp-main.1110

03

Grounded agents predict real answers

Grounding an agent in a person's own first-person interview predicts that individual's real survey answers at ~85% of their own test–retest reliability.

Park et al. (2024) · arXiv:2411.10109 · arxiv.org/abs/2411.10109

Quickstart

Load it in one line.

Column names are English; values are native Japanese (with a full JA→EN reference in the data card, so it's usable without reading Japanese).

quickstart.py

from datasets import load_dataset

ds = load_dataset("furuchanchan/japan-synthetic-personas", split="train")
print(len(ds), "personas")          # 3000
print(ds[0]["backstory_250w"])      # first-person narrative (Japanese)

# segment: women in their 30s, high income, high EC adoption
seg = ds.filter(lambda r: r["sex"]=="女" and r["age_band"]=="30代"
                 and r["income_tier"]=="high" and r["ec_adoption"]=="高")

examples/quickstart.py

Load the dataset and pull a demographic segment in 30 seconds.

View on GitHub ↗

examples/synthetic_survey.py

Run an LLM-driven concept test over the personas — the core use case.

View on GitHub ↗

› Concept testing › Survey pre-simulation › Agent-based marketing simulation › LLM evaluation personas

Real answers

See what real answers look like.

The synthetic panel is free and instant. When a decision rides on it, you ask real Japanese respondents — and this is what comes back: anonymized, structured, with English translations. Below is a live sample from an actual survey (N=115).

Q1お金を使うこと（消費）に対してどう思いますか？How do you feel about spending money?

Q2消費をするのはどの時間帯が多いですか？What time of day do you usually spend?

Q3日常生活における消費活動の特徴を教えてくださいDescribe your consumption habits in daily life.

60 · Female · AichiUnder ¥2MPositive

Part-time worker · shops in the daytime

基本は節約、衣服には流行があるし、家電も当たり外れがあるので安さ優先。ただ、友人とランチに行ったり、ご近所づきあいには、必要以上にケチりたくない。話題のお店に一度は行く、どうしても食べたいものは少々高くても食べる、くらいの贅沢はする

Basically saving, prioritizing cheapness for clothes due to trends and for appliances due to hit or miss. However, I don't want to be stingy with lunches with friends or neighborhood relations. I allow myself small luxuries like going to trendy restaurants once or eating what I really want even if it's a bit expensive.

35 · Male · Aichi¥10–12MPositive

Company employee (technical) · mornings & daytime

コスパを重視し、事前に口コミを調べてから購入します。

Focus on cost performance and check reviews before purchasing.

25 · Male · Shiga¥2–4MNeutral

Company employee (other) · daytime & night

推し活

Supporting my favorite idols/creators (oshi-katsu)

60 · Female · Hiroshima¥4–6MNegative

Part-time worker · shops in the morning

常にお金の計算をしながら買う

Always buy while calculating money

Download the sample ↓ 115 real respondents · anonymized · Japanese + English · demographics + 3 questions (CSV)

Participate

Use it, ask real people, or help calibrate.

This is an open initiative. Take the free data and build with it, ask real Japanese respondents when it counts, or contribute answers and expertise that make the open dataset more accurate for everyone.

Free · Open

Use the open data

Download 3,000 personas under CC BY 4.0 — commercial and research use welcome. Reproduction code and the full data card are on GitHub.

Hugging Face ↗ GitHub repo ↗

Real respondents

Ask real Japanese people

Put your question to real Japanese respondents. Target by age, gender, region, occupation, education and income; get individual answers plus segment aggregates. Every answer also calibrates the open dataset.

$0.30 / answer (questions × respondents), from 3,000 answers — e.g. 10 questions × 300 people = $900.

Open a request on GitHub → Email [email protected]

No charge to ask — we confirm scope & price first, and send a secure link only if you proceed.

Contribute · Discuss

Help calibrate

Have real survey data, a panel, or research to contribute? Or a question about the method? Bring it to the community — contributions that sharpen calibration are credited.

HF Discussions ↗ Contribute on GitHub ↗ Join Discord ↗

How requests are handled today: real-respondent studies and data contributions are run per request — open a GitHub issue or email us with your question, target, and rough sample size, and we'll confirm feasibility and reply. Calibration is ongoing work; the consumer layer is directionally grounded and not yet validated against real category-level purchase data, and we publish results as they land.

Start here

Start with the free dataset.

3,000 grounded synthetic Japanese consumers, open under CC BY 4.0. Pre-test on synthetic — then ask real people when the decision rides on it.

Get it on Hugging Face → Browse the code Join Discord

Facing a real decision? Ask real Japanese people →

Model Japanese consumers.Then check them against real people.