Building Patient Support: How We Created 2,792 Patients From a Knowledge Graph

The Pipeline

Generating a synthetic patient that feels real requires more than a language model and a prompt. It requires structured medical data, demographic grounding, and a system that ensures no two patients tell the same story.

Here's how we do it.

Step 1: Disease Normalization

When a user types "diabetes" into the onboard flow, we don't just search for "diabetes." We normalize it against PrimeKG's ontology of 17,080 diseases, mapping to the correct MONDO identifier and pulling its full comorbidity subgraph.

This means "Type 2 diabetes" resolves to its known associations: obesity, hypertension, hyperlipidemia, depression, peripheral neuropathy, diabetic retinopathy — each with prevalence weights from the knowledge graph.

Step 2: Demographic Sampling

Once we have the disease and its comorbidity network, we sample demographics using CDC NHANES 2021-2023 prevalence tables. These tables are triple-weighted:

Age band prevalence — how common is this condition in each decade of life?
Sex-specific modification — does prevalence differ by gender?
Race-specific modifiers — how does prevalence vary across ethnic groups?

This ensures our generated patients reflect real-world demographic patterns, not random distributions.

Step 3: The Five-Role Cohort Factory

Every support group contains exactly five personas, each assigned a structural role:

The Mirror (closest demographic match) shares the user's age range, race, and geography. Their job is validation — "I'm going through the same thing."

The Veteran (longest disease duration) has lived with the condition for years. They offer hard-won practical knowledge.

The Navigator (research-oriented) carries an extra comorbidity from the PrimeKG subgraph. They've done the homework.

The Ally (emotional support) leads with empathy. They ask questions more than they give advice.

The Specialist (condition expert) references a specific medication from the drug-disease edges in PrimeKG. They know the details.

Step 4: Backstory Generation

Each persona gets a backstory that integrates their medical data with life context:

Occupation matched to age, gender, and geography
Family situation appropriate to their demographics
Daily routine shaped by their conditions
Communication style that reflects their personality role
Barriers to care specific to their insurance and location
A hidden layer — something they won't share until trust is built

These backgrounds are deterministic (same seed = same output) and have been QA-audited across 17 consistency rules with zero violations across all 2,792 profiles.

Step 5: Verification

Before any persona is served to a user, it passes through a verification pipeline that checks:

Distributional accuracy — does the cohort's demographic mix match CDC prevalence?
Clinical concordance — are the comorbidities and medications consistent with the primary condition?
Privacy — does the profile inadvertently match a real person?
Equity — are underserved populations proportionally represented?

The Numbers

17,080 diseases in PrimeKG
177 conditions with pre-seeded personas
2,792 total pre-seeded patients
6 ethnic groups represented proportionally
5 structural roles per support group
0 real patient data used

Building Patient Support: How We Created 2,792 Patients From a Knowledge Graph

The Pipeline

Step 1: Disease Normalization

Step 2: Demographic Sampling

Step 3: The Five-Role Cohort Factory

Step 4: Backstory Generation

Step 5: Verification

The Numbers

Related Posts

Patient Support Groups for Crohn's Disease and IBD: Resources and Community

How Disease Knowledge Graphs Connect Conditions You Didn't Know Were Related

Talk to patients who understand