Skip to main content
All posts
Research8 min read

What Is PrimeKG? The Harvard Knowledge Graph Behind 17,000 Disease Maps

PrimeKG is a Harvard knowledge graph mapping 17,080 diseases to genes, drugs, and symptoms. How it works, why it matters for patient support, and its limits.

PatientSupport Team

Content Team

·
What Is PrimeKG? The Harvard Knowledge Graph Behind 17,000 Disease Maps

When you type a disease name into a search engine, you get a list of web pages. When a clinician types a disease name into an electronic health record, they get a billing code. Neither of these tells you what the disease actually connects to — which other conditions travel with it, which genes are implicated, which drugs target its mechanisms, or which symptoms overlap with something else entirely.

A knowledge graph does. And PrimeKG is one of the most comprehensive medical knowledge graphs ever assembled.

What a Knowledge Graph Actually Is

A knowledge graph is a structured representation of relationships between entities. In medicine, the entities are things like diseases, genes, proteins, drugs, symptoms, and biological pathways. The relationships are statements like "Disease A is associated with Gene B," "Drug C targets Protein D," or "Disease E has symptom F."

Unlike a database table, which stores flat rows of data, a knowledge graph stores a network. You can traverse it — start at one disease and follow its connections outward to find related conditions, shared mechanisms, and potential treatment targets. This is what makes knowledge graphs useful for tasks that require understanding context, not just retrieving facts.

Google's Knowledge Graph is the most famous consumer example — it powers those information panels that appear when you search for a person, place, or thing. PrimeKG applies the same structural concept to medicine, but with far more rigor and specificity.

PrimeKG: The Numbers

PrimeKG — Precision Medicine Knowledge Graph — was developed by researchers at Harvard and published in Nature Scientific Data in 2023 (Chandak, Huang & Pauli). The dataset is hosted on the Harvard Dataverse and is freely available for research use.

The scale is significant:

  • 17,080 diseases mapped with standardized ontology identifiers (MONDO)
  • 29,786 genes and proteins linked to disease mechanisms
  • 4,050 drugs connected to their targets, indications, and contraindications
  • Over 4 million relationships connecting these entities across 20 different relationship types
  • Integrated from 20 source databases including DrugBank, DisGeNET, the Human Protein Atlas, Reactome, SIDER, and the Disease Ontology
These are not approximate numbers scraped from the internet. They are curated, structured data pulled from peer-reviewed biomedical databases and reconciled into a single unified graph.

Why PrimeKG Matters for Patient Support

Most AI health tools — chatbots, symptom checkers, search engines with AI summaries — generate responses from language models trained on text. The text includes medical knowledge, but also includes blog posts, forum comments, outdated guidelines, and outright misinformation. The model cannot distinguish between a Cochrane review and a wellness blog post. It predicts the most likely next word, not the most medically accurate one.

This is the hallucination problem in healthcare AI. A language model can generate plausible-sounding medical text that is factually wrong, and it does so with confidence. Studies have documented hallucination rates in medical AI ranging from 5% to over 30%, depending on the model and the domain.

Knowledge graphs address this differently. Instead of generating text from statistical patterns, a knowledge-grounded system can:

  • Normalize a disease name to its canonical form (so "sugar diabetes" resolves to Type 2 diabetes mellitus, MONDO:0005148)
  • Map comorbidities by traversing real disease-disease associations in the graph (so Type 2 diabetes connects to obesity, hypertension, peripheral neuropathy, and depression through validated edges)
  • Identify shared mechanisms by following gene and protein connections (so two seemingly unrelated conditions can be linked through a common biological pathway)
  • Surface drug interactions by checking the drug-protein and drug-disease edges in the graph
None of this requires the system to "know" these facts in the way a language model knows things. The facts are looked up in a structured, citable dataset. The system retrieves rather than generates, which means the output can be traced back to its source.

How PatientSupport.AI Uses PrimeKG

PatientSupport.AI integrates PrimeKG as the foundational data layer for understanding disease relationships. Here is what that means in practice:

Disease Normalization

When a user enters a condition during onboarding — "lupus," "type 1 diabetes," "Parkinson's" — the system normalizes it against PrimeKG's ontology of 17,080 diseases. This ensures that the conversation is anchored to a medically specific entity, not a vague keyword. "Arthritis" becomes rheumatoid arthritis or osteoarthritis depending on context, because the knowledge graph distinguishes between them at the mechanism level.

Comorbidity Mapping

Each disease in PrimeKG has validated associations with other diseases. The system uses these associations to construct a comorbidity profile — not by guessing, but by following edges in the graph that were derived from clinical databases. This is how PatientSupport.AI can surface connections between conditions that a patient might not have considered (like the link between celiac disease and autoimmune thyroiditis, or between COPD and osteoporosis).

Grounding AI Responses

PatientSupport.AI uses Groq's Llama 70B model for natural language generation. But the model does not operate in a vacuum — it is grounded in PrimeKG data. When the system describes a condition, its comorbidities, or its treatment landscape, that description is informed by structured knowledge graph data, not just language model predictions. This reduces — but does not eliminate — the risk of hallucinated medical information.

What PrimeKG Cannot Do

Intellectual honesty about limitations matters more in healthcare AI than in any other domain. PrimeKG has important constraints:

  • It is a snapshot, not a live feed. The graph reflects the state of biomedical knowledge at the time of its compilation. New drug approvals, newly discovered disease associations, and emerging research are not automatically captured. PrimeKG needs to be periodically updated.
  • It does not capture prevalence or incidence. The graph knows that Type 2 diabetes is associated with peripheral neuropathy, but it does not know what percentage of patients develop it. This is demographic and epidemiological data that lives in different datasets (like CDC NHANES).
  • It does not encode clinical guidelines. PrimeKG captures biological relationships, not treatment recommendations. Knowing that Drug A targets Protein B is not the same as knowing when to prescribe Drug A.
  • It reflects historical bias. Biomedical databases overrepresent conditions that have been heavily studied (cancers, cardiovascular disease) and underrepresent conditions that have been historically neglected (many rare diseases, conditions disproportionately affecting underresearched populations).
  • It is not a diagnostic tool. Nothing in PrimeKG replaces clinical judgment. The graph tells you which diseases are associated — it does not tell you which one a specific patient has.

The Broader Context: Knowledge Graphs in Medical AI

PrimeKG is not the only medical knowledge graph, but it is among the most comprehensive publicly available ones. Other notable projects include:

The trend across all of these is clear: structured, curated medical knowledge provides a stronger foundation for AI-assisted health tools than language models operating on unstructured text. Knowledge graphs will not replace language models — the two are complementary. But grounding language models in structured data is one of the most promising approaches to reducing hallucination in healthcare AI.

What This Means for You

If you are a patient or caregiver using PatientSupport.AI, PrimeKG is the reason the system can tell you which conditions are related to yours, which biological mechanisms connect them, and which parts of the medical landscape might be relevant to your situation. It is the difference between an AI that guesses based on web text and an AI that looks up structured, peer-reviewed data.

That said, it remains an informational tool. It is free to use without an account — an optional free account lets you save your conversation history. But it does not diagnose conditions, prescribe treatments, or replace conversations with your medical team. It exists to make those conversations more informed.

Disclaimer: This article is for informational purposes only. PrimeKG is a research dataset and is not a diagnostic or clinical decision-support tool. PatientSupport.AI is not a medical provider and does not offer medical advice. The AI may produce inaccurate information despite knowledge graph grounding. Always consult qualified healthcare professionals for medical decisions.

Citation: Chandak, C., Huang, S. & Pauli, M. Building a knowledge graph to enable precision medicine. Sci Data 10, 67 (2023). https://doi.org/10.1038/s41597-023-01960-3

PrimeKGknowledge graphpatient supportHarvard Dataversemedical AIdisease relationshipsNature Scientific Data

Talk to patients who understand

AI-generated support groups built from real medical data. 30 seconds to start.

Get Started