Methodology
A transparent 7-step scientific pipeline, from PubMed to a graded conclusion.
1Our promise
Full transparency — every tier rating is backed by verifiable calculation logic: the formula is public, the weights are public, the PMIDs are public.
A limited promise — we do not claim "the truth," only "the best currently verifiable evidence." When new evidence appears, ratings are updated.
Independence — no manufacturer sponsorship, no paid placement, no sponsored content.
2The 7-layer evidence engine
Each claim is fetched and verified in parallel by 12 sub-agents, grouped into 7 evidence layers:
| Layer | Source | What it fetches |
|---|---|---|
| L1 | Examine.com | Existing conclusion summaries + evidence grades |
| L2 | PubMed | Systematic reviews, meta-analyses, RCTs |
| L4a | FDA / EMA / NIH ODS / Harvard / AAD, etc. | Regulatory status, institutional positions |
| L4b | TFDA / Ministry of Health and Welfare | Taiwan regulatory status |
| L5b | Cochrane / large RCTs | Randomized controlled trials |
| L5c | Meta-analysis journals | Systematic synthesis |
| L5d | FDA Safety Communications / DailyMed | Side effects, contraindications |
| L5e | ClinicalTrials.gov | Ongoing trials |
| L10a | Manufacturer websites, advertising | Advertising-claim intensity |
| L10b | Cross-platform marketing | Advertising contamination level |
| L10c | PTT / Dcard / Mobile01 | Taiwan community discussion |
| L11 | Independent Claude Opus assessment | Cross-layer sanity check |
Every citation in every layer must be grounded in a PMID or a traceable URL. Before publication, all PMIDs are checked against NCBI Entrez to confirm they actually exist (currently 0 hallucinations).
3The 6-tier rating system
Scoring formula (simplified)
raw_score = weighted_avg({
L1_score × 1.0, // Examine grade
L2_score × 1.0, // PubMed direction signal
L4a_score × 1.0, // authoritative position (NIH ODS gets an extra ×1.5)
L4b_score × 0.5, // TFDA position (lower weight)
L5b_score × 1.5, // Cochrane / large RCTs
L5c_score × 1.5, // Meta-analysis
L5d_score × 1.0, // safety (negative adjustment)
})
→ score_to_tier(raw_score) Strict tier-floor requirements
A score alone is not enough — a high tier must also clear structural gates before it can be awarded:
- S tier requires: at least 2 Cochrane / large independent systematic reviews that agree
- A tier requires: at least 1 meta-analysis or large RCT + an NIH ODS / Examine A grade
- B tier requires: at least 1 RCT or an Examine B grade
- C tier requires: at least 1 small human trial
- D tier: high-quality evidence points predominantly toward "ineffective" or "harmful"
- U tier: none of the above is met (insufficient data)
If the score is high but a structural gate is not cleared, the rating is forcibly downgraded.
4Cross-layer sanity check (L11)
After raw_score is computed, Claude Opus makes an independent assessment without looking at any of the other layers:
- It searches PubMed directly and reads the abstracts
- It produces an independent grade with its own reasoning
- If Opus's conclusion conflicts sharply with the aggregator → it triggers an escalation:
- counter_evidence honor: if L11 judges it "ineffective" while the aggregator computed B or above → forcibly downgrade to D
- safety_review honor: if L11 detects an active FDA warning → forcibly flag the
safety_reviewstatus
This layer is designed to catch cases where "the scoring went wrong but a human reading it would immediately see the error."
5Taiwan noise index (Consumer Market Risk)
Independent of the evidence tier, every claim also carries a consumer-market-noise level:
| Level | Icon | Meaning |
|---|---|---|
| high | 🚨 | Extremely high marketing intensity in the Taiwan market; PTT/Dcard chatter is chaotic; extra caution needed |
| medium | ⚠️ | Moderate marketing, but credible sources exist for cross-checking |
| low | ✨ | Clean market information; judgment rests mainly on medical evidence |
Why we present them separately: a supplement may be scientifically A tier (effective) yet 🚨 high CMR in the market (over-hyped advertising, wide quality variation among Taiwan brands) — both messages matter, and users need to see them at the same time.
6Status labels
Each claim also carries a status that expresses "how it is handled from a media and regulatory perspective":
| Status | Meaning | Count |
|---|---|---|
| published | Mainstream consensus supports it; safe to cite | 109 |
| published_with_warning | Evidence supports it but with side effects or population limits | 355 |
| disputed | Mainstream literature reaches contradictory conclusions | 89 |
| counter_evidence | High-quality evidence shows it is ineffective or harmful | 65 |
| needs_more_evidence | Not enough evidence to reach a conclusion | 45 |
| tw_blackbox | Taiwan-market information is extremely opaque | 42 |
| safety_review | FDA / EMA / TFDA has issued an active safety warning | 12 |
7Update cadence
- PubMed fetch: monthly (checking for newly published research)
- Regulatory-status check: quarterly
- TW community check: monthly (PTT / Dcard / Mobile01)
- L11 sanity re-run: every six months (or triggered when major evidence appears)
- Version tags: each claim page footer is stamped with
engine_version+aggregated_at
If a claim has not been re-reviewed by anyone within two years, the page is flagged with a ⚠️ evidence may be outdated notice.
8Accuracy & calibration
We ran a blind audit on ourselves. We randomly drew 5% of the claims (36 of them, 6 from each of the S/A/B/C/D/U tiers) and handed every one to an independent assessor who re-graded it from scratch — re-checking PubMed / Cochrane / FDA / NIH, actively searching for counter-evidence, using the very same public S–U criteria, never seeing the engine's original tier until after grading was done.
| Metric | Result |
|---|---|
| Exact tier match | 39% (14 / 36) |
| Within ±1 tier | 75% (27 / 36) |
| Engine more optimistic than the independent assessor (over-rating direction) | 3 (2 of which were actually the engine being right) |
| Engine more conservative than the independent assessor | 19 |
On a 6-tier ordinal scale, within ±1 tier is the reasonable agreement band — given the same body of evidence, different reviewers routinely differ by one tier. What really matters is the direction of the error.
The most important direction: for health content, the most dangerous error is "over-rating the evidence — calling something effective when it isn't." In that direction the engine almost never errs — out of 36 samples only 1 rating was too lenient (chondroitin × skin aging, since corrected to U via the adjudication described below), and almost all the remaining disagreements were the engine being more cautious than the independent assessor. For YMYL health topics, we would rather under-rate than over-state.
An honest caveat: this was a single independent assessor, not absolute truth — it too may lean optimistic. So we treat every disagreement as a signal that "this claim should be re-examined," not as an automatic instruction to change the tier. Health ratings should not be flipped on a whim by a single opinion.
We have already done the next round: two multi-reviewer adjudication panels (3 independent reviewers per claim + structural tier-floor rules) re-reviewed 44 disputed and "insufficient-data" claims and applied 31 corrections — including 8 downgrades for "high-quality evidence proving it ineffective" (e.g., vitamin E × cardiovascular disease, kava × insomnia), as well as upgrades for definitional therapies (e.g., vitamin C × scurvy U→A, coenzyme Q10 × heart failure U→B). Every corrected claim page includes a "manually adjudicated rating" note and fully preserves the engine's originally computed tier. We re-run the audit periodically and publish updates here.
9Known limitations
We do not pretend this engine is perfect:
- L2 PubMed recall is limited — about 80–90% of mainstream literature is covered; obscure journals and preprints may be missed
- L10c TW community coverage is biased toward public platforms — private Facebook groups and LINE groups cannot be scraped
- Language bias — we mainly fetch English-language literature; some Japanese and Korean studies may be missed
- Supplements update fast — new ingredients (such as spermidine / urolithin-A) need several months before there is enough PubMed literature to grade
- L11 Opus carries a hallucination risk — PMID back-checking guards against fabrication, but the reasoning text may still be inaccurate
We continuously log these errors and correct the engine.
10Cite this site
If you are a health-content creator, researcher, or developer, you are welcome to cite this site:
@misc{gptdict_2026,
title = "{gpt-dict.com}: A Scientific Evidence Database for Supplements",
author = "Lenny Chen and gpt-dict-engine contributors",
year = "2026",
url = "https://gpt-dict.com/"
} Or add this in your article:
<a href="https://gpt-dict.com/claim/{CLAIM_ID}/" rel="external">
Evidence source: gpt-dict.com
</a> We hope this database becomes the foundational evidence layer for Taiwan's health-content ecosystem.