Methodology

Methodology

A transparent 7-step scientific pipeline, from PubMed to a graded conclusion.

1Our promise

Full transparency — every tier rating is backed by verifiable calculation logic: the formula is public, the weights are public, the PMIDs are public.

A limited promise — we do not claim "the truth," only "the best currently verifiable evidence." When new evidence appears, ratings are updated.

Independence — no manufacturer sponsorship, no paid placement, no sponsored content.

2The 7-layer evidence engine

Each claim is fetched and verified in parallel by 12 sub-agents, grouped into 7 evidence layers:

LayerSourceWhat it fetches
L1Examine.comExisting conclusion summaries + evidence grades
L2PubMedSystematic reviews, meta-analyses, RCTs
L4aFDA / EMA / NIH ODS / Harvard / AAD, etc.Regulatory status, institutional positions
L4bTFDA / Ministry of Health and WelfareTaiwan regulatory status
L5bCochrane / large RCTsRandomized controlled trials
L5cMeta-analysis journalsSystematic synthesis
L5dFDA Safety Communications / DailyMedSide effects, contraindications
L5eClinicalTrials.govOngoing trials
L10aManufacturer websites, advertisingAdvertising-claim intensity
L10bCross-platform marketingAdvertising contamination level
L10cPTT / Dcard / Mobile01Taiwan community discussion
L11Independent Claude Opus assessmentCross-layer sanity check

Every citation in every layer must be grounded in a PMID or a traceable URL. Before publication, all PMIDs are checked against NCBI Entrez to confirm they actually exist (currently 0 hallucinations).

3The 6-tier rating system

S Strong ≥ 0.825 Multiple high-quality systematic reviews agree in support
A Moderate 0.70–0.825 Backed by large RCTs or meta-analyses
B Preliminary 0.55–0.70 RCTs exist but evidence strength is limited
C Weak 0.40–0.55 Only small trials or mechanistic studies
D Counter 0.25–0.40 High-quality evidence shows it is ineffective or harmful
U Insufficient < 0.25 Not enough evidence for any judgment

Scoring formula (simplified)

raw_score = weighted_avg({
    L1_score      × 1.0,      // Examine grade
    L2_score      × 1.0,      // PubMed direction signal
    L4a_score     × 1.0,      // authoritative position (NIH ODS gets an extra ×1.5)
    L4b_score     × 0.5,      // TFDA position (lower weight)
    L5b_score     × 1.5,      // Cochrane / large RCTs
    L5c_score     × 1.5,      // Meta-analysis
    L5d_score     × 1.0,      // safety (negative adjustment)
})

→ score_to_tier(raw_score)

Strict tier-floor requirements

A score alone is not enough — a high tier must also clear structural gates before it can be awarded:

  • S tier requires: at least 2 Cochrane / large independent systematic reviews that agree
  • A tier requires: at least 1 meta-analysis or large RCT + an NIH ODS / Examine A grade
  • B tier requires: at least 1 RCT or an Examine B grade
  • C tier requires: at least 1 small human trial
  • D tier: high-quality evidence points predominantly toward "ineffective" or "harmful"
  • U tier: none of the above is met (insufficient data)

If the score is high but a structural gate is not cleared, the rating is forcibly downgraded.

4Cross-layer sanity check (L11)

After raw_score is computed, Claude Opus makes an independent assessment without looking at any of the other layers:

  • It searches PubMed directly and reads the abstracts
  • It produces an independent grade with its own reasoning
  • If Opus's conclusion conflicts sharply with the aggregator → it triggers an escalation:
    • counter_evidence honor: if L11 judges it "ineffective" while the aggregator computed B or above → forcibly downgrade to D
    • safety_review honor: if L11 detects an active FDA warning → forcibly flag the safety_review status

This layer is designed to catch cases where "the scoring went wrong but a human reading it would immediately see the error."

5Taiwan noise index (Consumer Market Risk)

Independent of the evidence tier, every claim also carries a consumer-market-noise level:

LevelIconMeaning
high🚨Extremely high marketing intensity in the Taiwan market; PTT/Dcard chatter is chaotic; extra caution needed
medium⚠️Moderate marketing, but credible sources exist for cross-checking
lowClean market information; judgment rests mainly on medical evidence

Why we present them separately: a supplement may be scientifically A tier (effective) yet 🚨 high CMR in the market (over-hyped advertising, wide quality variation among Taiwan brands) — both messages matter, and users need to see them at the same time.

6Status labels

Each claim also carries a status that expresses "how it is handled from a media and regulatory perspective":

StatusMeaningCount
publishedMainstream consensus supports it; safe to cite109
published_with_warningEvidence supports it but with side effects or population limits355
disputedMainstream literature reaches contradictory conclusions89
counter_evidenceHigh-quality evidence shows it is ineffective or harmful65
needs_more_evidenceNot enough evidence to reach a conclusion45
tw_blackboxTaiwan-market information is extremely opaque42
safety_reviewFDA / EMA / TFDA has issued an active safety warning12

7Update cadence

  • PubMed fetch: monthly (checking for newly published research)
  • Regulatory-status check: quarterly
  • TW community check: monthly (PTT / Dcard / Mobile01)
  • L11 sanity re-run: every six months (or triggered when major evidence appears)
  • Version tags: each claim page footer is stamped with engine_version + aggregated_at

If a claim has not been re-reviewed by anyone within two years, the page is flagged with a ⚠️ evidence may be outdated notice.

8Accuracy & calibration

We ran a blind audit on ourselves. We randomly drew 5% of the claims (36 of them, 6 from each of the S/A/B/C/D/U tiers) and handed every one to an independent assessor who re-graded it from scratch — re-checking PubMed / Cochrane / FDA / NIH, actively searching for counter-evidence, using the very same public S–U criteria, never seeing the engine's original tier until after grading was done.

MetricResult
Exact tier match39% (14 / 36)
Within ±1 tier75% (27 / 36)
Engine more optimistic than the independent assessor (over-rating direction)3 (2 of which were actually the engine being right)
Engine more conservative than the independent assessor19

On a 6-tier ordinal scale, within ±1 tier is the reasonable agreement band — given the same body of evidence, different reviewers routinely differ by one tier. What really matters is the direction of the error.

The most important direction: for health content, the most dangerous error is "over-rating the evidence — calling something effective when it isn't." In that direction the engine almost never errs — out of 36 samples only 1 rating was too lenient (chondroitin × skin aging, since corrected to U via the adjudication described below), and almost all the remaining disagreements were the engine being more cautious than the independent assessor. For YMYL health topics, we would rather under-rate than over-state.

An honest caveat: this was a single independent assessor, not absolute truth — it too may lean optimistic. So we treat every disagreement as a signal that "this claim should be re-examined," not as an automatic instruction to change the tier. Health ratings should not be flipped on a whim by a single opinion.

We have already done the next round: two multi-reviewer adjudication panels (3 independent reviewers per claim + structural tier-floor rules) re-reviewed 44 disputed and "insufficient-data" claims and applied 31 corrections — including 8 downgrades for "high-quality evidence proving it ineffective" (e.g., vitamin E × cardiovascular disease, kava × insomnia), as well as upgrades for definitional therapies (e.g., vitamin C × scurvy U→A, coenzyme Q10 × heart failure U→B). Every corrected claim page includes a "manually adjudicated rating" note and fully preserves the engine's originally computed tier. We re-run the audit periodically and publish updates here.

9Known limitations

We do not pretend this engine is perfect:

  • L2 PubMed recall is limited — about 80–90% of mainstream literature is covered; obscure journals and preprints may be missed
  • L10c TW community coverage is biased toward public platforms — private Facebook groups and LINE groups cannot be scraped
  • Language bias — we mainly fetch English-language literature; some Japanese and Korean studies may be missed
  • Supplements update fast — new ingredients (such as spermidine / urolithin-A) need several months before there is enough PubMed literature to grade
  • L11 Opus carries a hallucination risk — PMID back-checking guards against fabrication, but the reasoning text may still be inaccurate

We continuously log these errors and correct the engine.

10Cite this site

If you are a health-content creator, researcher, or developer, you are welcome to cite this site:

@misc{gptdict_2026,
  title  = "{gpt-dict.com}: A Scientific Evidence Database for Supplements",
  author = "Lenny Chen and gpt-dict-engine contributors",
  year   = "2026",
  url    = "https://gpt-dict.com/"
}

Or add this in your article:

<a href="https://gpt-dict.com/claim/{CLAIM_ID}/" rel="external">
  Evidence source: gpt-dict.com
</a>

We hope this database becomes the foundational evidence layer for Taiwan's health-content ecosystem.