The Pipeline
Generating a synthetic patient that feels real requires more than a language model and a prompt. It requires structured medical data, demographic grounding, and a system that ensures no two patients tell the same story.
Here's how we do it.
Step 1: Disease Normalization
When a user types "diabetes" into the onboard flow, we don't just search for "diabetes." We normalize it against PrimeKG's ontology of 17,080 diseases, mapping to the correct MONDO identifier and pulling its full comorbidity subgraph.
This means "Type 2 diabetes" resolves to its known associations: obesity, hypertension, hyperlipidemia, depression, peripheral neuropathy, diabetic retinopathy — each with prevalence weights from the knowledge graph.
Step 2: Demographic Sampling
Once we have the disease and its comorbidity network, we sample demographics using CDC NHANES 2021-2023 prevalence tables. These tables are triple-weighted:
- Age band prevalence — how common is this condition in each decade of life?
- Sex-specific modification — does prevalence differ by gender?
- Race-specific modifiers — how does prevalence vary across ethnic groups?
Step 3: The Five-Role Cohort Factory
Every support group contains exactly five personas, each assigned a structural role:
The Mirror (closest demographic match) shares the user's age range, race, and geography. Their job is validation — "I'm going through the same thing."
The Veteran (longest disease duration) has lived with the condition for years. They offer hard-won practical knowledge.
The Navigator (research-oriented) carries an extra comorbidity from the PrimeKG subgraph. They've done the homework.
The Ally (emotional support) leads with empathy. They ask questions more than they give advice.
The Specialist (condition expert) references a specific medication from the drug-disease edges in PrimeKG. They know the details.
Step 4: Backstory Generation
Each persona gets a backstory that integrates their medical data with life context:
- Occupation matched to age, gender, and geography
- Family situation appropriate to their demographics
- Daily routine shaped by their conditions
- Communication style that reflects their personality role
- Barriers to care specific to their insurance and location
- A hidden layer — something they won't share until trust is built
Step 5: Verification
Before any persona is served to a user, it passes through a verification pipeline that checks:
- Distributional accuracy — does the cohort's demographic mix match CDC prevalence?
- Clinical concordance — are the comorbidities and medications consistent with the primary condition?
- Privacy — does the profile inadvertently match a real person?
- Equity — are underserved populations proportionally represented?
The Numbers
- 17,080 diseases in PrimeKG
- 177 conditions with pre-seeded personas
- 2,792 total pre-seeded patients
- 6 ethnic groups represented proportionally
- 5 structural roles per support group
- 0 real patient data used