Open dataset · CC BY 4.0

Model Japanese consumers.
Then check them against real people.

3,000 statistically grounded synthetic Japanese personas — free and open on Hugging Face. Pre-test any question on the synthetic panel in seconds, then put the same question to real Japanese respondents. Every real answer is ground truth that sharpens the open dataset.

No signup — load_dataset() and go · Free under CC BY 4.0

N=3,000 personas 47 prefectures 36 columns e-Stat grounded
CC BY 4.0 Reproducible (fixed seeds) First-person narratives Built by TechWorker
Personas
3,000
Synthetic, statistically grounded
Geography
47/47
All prefectures, population-weighted
Mean household income
¥5.37M
e-Stat grounded, reproduces the age curve
License
CC BY 4.0
Free for commercial & research use
The approach

Synthetic is fast. Real is the ground truth.

A synthetic panel is a statistically grounded model of Japanese consumers — free, instant, ideal for wide early exploration. But a model is not the people themselves: the consumer-behavior layer is aligned with the direction of official statistics, not yet calibrated against real category-level answers. So this is an open project with two layers, run in a loop.

Layer 1 · Free

Synthetic panel

3,000 personas grounded in Japan's demographics and household-income distributions. Query, segment, and simulate in seconds.

  • Instant, unlimited, zero cost
  • Wide early exploration & pre-testing
  • First-person narratives for high-fidelity prompting
  • Directional — not yet calibrated to real purchases
Calibrate
real answers sharpen the synthetic
Layer 2 · Real answers

Real Japanese respondents

Put the same question to real people when a decision rides on it — no full research project to stand up.

  • Ground truth for the decisions that matter
  • Individual answers + segment-level aggregates
  • Target by age, gender, region, occupation, income
  • Every answer calibrates the open dataset
from $0.30 / answer
How it's built

A reproducible, 3-layer cascade.

Every persona is grounded in public data with fixed seeds, so the whole pipeline reproduces. Base personas come from NVIDIA's census-grounded set; income and consumer behavior are conditioned on Japanese government statistics.

L0

Population

Stratified sample by age-band × sex to match national population proportions → 3,000 personas.

Base · NVIDIA Nemotron-Personas-Japan
L1

Income grounding

P(income | head-of-household age) combined with a prefecture income index — joint age × region conditioning.

e-Stat · MHLW & MIC surveys
L2

Consumer layer

Income-tier-conditioned price sensitivity, brand orientation, channels & EC — both poles always kept, no homogenization.

This dataset · statistics-aligned
L3

First-person narrative

A name-based, first-person life story per persona — attributes dissolved into a life, not a label list.

This dataset · backstory_250w
A · B
Persona & demographics
(Nemotron-Personas-Japan)
uuidsexageage_bandmarital_statuseducation_leveloccupationprefectureregionpersonahobbies_and_interestsskills_and_expertise …and more
C
Household-income grounding
(e-Stat official statistics)
household_income_brackethousehold_income_midpoint_manyenincome_tierhousehold_income_source
D
Consumer-behavior layer
(this dataset)
price_sensitivitybrand_orientationpromotion_responsivenessbulk_buy_tendencyec_adoptionprimary_purchase_channelsmedia_contactdisposable_income_feel
E
First-person narrative
(this dataset)
backstory_250w — a ~220–260 char first-person account of daily life (avg 278.8 chars).
Why it works

First-person narratives, not attribute lists.

Condition an LLM on a demographic list alone and it drifts to the population average and reproduces stereotypes — collapsing the diversity simulation depends on. A concrete life story conditions the model far more richly. This is a design choice with a research basis.

01

Algorithmic fidelity

Conditioning a model on detailed real backstories lets it emulate the response distributions of many human subgroups — the basis of “silicon sampling.”

Argyle et al. (2023) · Political Analysis 31(3), 337–351 · doi.org/10.1017/pan.2023.2
02

Backstories beat attribute lists

Open-ended, naturalistic backstories yield more consistent and representative virtual personas — up to +18% representativeness and +27% consistency on Pew benchmarks.

Moon et al., “Anthology” · EMNLP 2024 · aclanthology.org/2024.emnlp-main.1110
03

Grounded agents predict real answers

Grounding an agent in a person's own first-person interview predicts that individual's real survey answers at ~85% of their own test–retest reliability.

Park et al. (2024) · arXiv:2411.10109 · arxiv.org/abs/2411.10109
Quickstart

Load it in one line.

Column names are English; values are native Japanese (with a full JA→EN reference in the data card, so it's usable without reading Japanese).

quickstart.py
from datasets import load_dataset

ds = load_dataset("furuchanchan/japan-synthetic-personas", split="train")
print(len(ds), "personas")          # 3000
print(ds[0]["backstory_250w"])      # first-person narrative (Japanese)

# segment: women in their 30s, high income, high EC adoption
seg = ds.filter(lambda r: r["sex"]=="女" and r["age_band"]=="30代"
                 and r["income_tier"]=="high" and r["ec_adoption"]=="高")
Concept testing Survey pre-simulation Agent-based marketing simulation LLM evaluation personas
Real answers

See what real answers look like.

The synthetic panel is free and instant. When a decision rides on it, you ask real Japanese respondents — and this is what comes back: anonymized, structured, with English translations. Below is a live sample from an actual survey (N=115).

Q1お金を使うこと(消費)に対してどう思いますか?How do you feel about spending money?
Q2消費をするのはどの時間帯が多いですか?What time of day do you usually spend?
Q3日常生活における消費活動の特徴を教えてくださいDescribe your consumption habits in daily life.
60 · Female · AichiUnder ¥2MPositive
Part-time worker · shops in the daytime

基本は節約、衣服には流行があるし、家電も当たり外れがあるので安さ優先。ただ、友人とランチに行ったり、ご近所づきあいには、必要以上にケチりたくない。話題のお店に一度は行く、どうしても食べたいものは少々高くても食べる、くらいの贅沢はする

Basically saving, prioritizing cheapness for clothes due to trends and for appliances due to hit or miss. However, I don't want to be stingy with lunches with friends or neighborhood relations. I allow myself small luxuries like going to trendy restaurants once or eating what I really want even if it's a bit expensive.

35 · Male · Aichi¥10–12MPositive
Company employee (technical) · mornings & daytime

コスパを重視し、事前に口コミを調べてから購入します。

Focus on cost performance and check reviews before purchasing.

25 · Male · Shiga¥2–4MNeutral
Company employee (other) · daytime & night

推し活

Supporting my favorite idols/creators (oshi-katsu)

60 · Female · Hiroshima¥4–6MNegative
Part-time worker · shops in the morning

常にお金の計算をしながら買う

Always buy while calculating money

Download the sample 115 real respondents · anonymized · Japanese + English · demographics + 3 questions (CSV)
Participate

Use it, ask real people, or help calibrate.

This is an open initiative. Take the free data and build with it, ask real Japanese respondents when it counts, or contribute answers and expertise that make the open dataset more accurate for everyone.

Free · Open

Use the open data

Download 3,000 personas under CC BY 4.0 — commercial and research use welcome. Reproduction code and the full data card are on GitHub.

Real respondents

Ask real Japanese people

Put your question to real Japanese respondents. Target by age, gender, region, occupation, education and income; get individual answers plus segment aggregates. Every answer also calibrates the open dataset.

$0.30 / answer (questions × respondents), from 3,000 answers — e.g. 10 questions × 300 people = $900.

No charge to ask — we confirm scope & price first, and send a secure link only if you proceed.

Contribute · Discuss

Help calibrate

Have real survey data, a panel, or research to contribute? Or a question about the method? Bring it to the community — contributions that sharpen calibration are credited.

How requests are handled today: real-respondent studies and data contributions are run per request — open a GitHub issue or email us with your question, target, and rough sample size, and we'll confirm feasibility and reply. Calibration is ongoing work; the consumer layer is directionally grounded and not yet validated against real category-level purchase data, and we publish results as they land.
Start here

Start with the free dataset.

3,000 grounded synthetic Japanese consumers, open under CC BY 4.0. Pre-test on synthetic — then ask real people when the decision rides on it.

Facing a real decision? Ask real Japanese people →

Free50 rowsNo signup
See the data in 5 seconds.
A 50-persona sample — all 36 columns. Opens in any spreadsheet.
Download sample (CSV)50 rows · 36 cols · 0.2 MB Full 3,000 on Hugging Face