Methodology

A transparent 7-step scientific pipeline, from PubMed to a graded conclusion.

1Our promise

Full transparency — every tier rating is backed by verifiable calculation logic: the formula is public, the weights are public, the PMIDs are public.

A limited promise — we do not claim "the truth," only "the best currently verifiable evidence." When new evidence appears, ratings are updated.

Independence — no manufacturer sponsorship, no paid placement, no sponsored content.

2The 7-layer evidence engine

Each claim is fetched and verified in parallel by 12 sub-agents, grouped into 7 evidence layers:

Layer	Source	What it fetches
L1	Examine.com	Existing conclusion summaries + evidence grades
L2	PubMed	Systematic reviews, meta-analyses, RCTs
L4a	FDA / EMA / NIH ODS / Harvard / AAD, etc.	Regulatory status, institutional positions
L4b	TFDA / Ministry of Health and Welfare	Taiwan regulatory status
L5b	Cochrane / large RCTs	Randomized controlled trials
L5c	Meta-analysis journals	Systematic synthesis
L5d	FDA Safety Communications / DailyMed	Side effects, contraindications
L5e	ClinicalTrials.gov	Ongoing trials
L10a	Manufacturer websites, advertising	Advertising-claim intensity
L10b	Cross-platform marketing	Advertising contamination level
L10c	PTT / Dcard / Mobile01	Taiwan community discussion
L11	Independent Claude Opus assessment	Cross-layer sanity check

Every citation in every layer must be grounded in a PMID or a traceable URL. Before publication, all PMIDs are checked against NCBI Entrez to confirm they actually exist (currently 0 hallucinations).

3The 6-tier rating system

S Strong ≥ 0.825 Multiple high-quality systematic reviews agree in support

A Moderate 0.70–0.825 Backed by large RCTs or meta-analyses

B Preliminary 0.55–0.70 RCTs exist but evidence strength is limited

C Weak 0.40–0.55 Only small trials or mechanistic studies

D Counter 0.25–0.40 High-quality evidence shows it is ineffective or harmful

U Insufficient < 0.25 Not enough evidence for any judgment

Scoring formula (simplified)

raw_score = weighted_avg({
    L1_score      × 1.0,      // Examine grade
    L2_score      × 1.0,      // PubMed direction signal
    L4a_score     × 1.0,      // authoritative position (NIH ODS gets an extra ×1.5)
    L4b_score     × 0.5,      // TFDA position (lower weight)
    L5b_score     × 1.5,      // Cochrane / large RCTs
    L5c_score     × 1.5,      // Meta-analysis
    L5d_score     × 1.0,      // safety (negative adjustment)
})

→ score_to_tier(raw_score)

Strict tier-floor requirements

A score alone is not enough — a high tier must also clear structural gates before it can be awarded:

S tier requires: at least 2 Cochrane / large independent systematic reviews that agree
A tier requires: at least 1 meta-analysis or large RCT + an NIH ODS / Examine A grade
B tier requires: at least 1 RCT or an Examine B grade
C tier requires: at least 1 small human trial
D tier: high-quality evidence points predominantly toward "ineffective" or "harmful"
U tier: none of the above is met (insufficient data)

If the score is high but a structural gate is not cleared, the rating is forcibly downgraded.

4Cross-layer sanity check (L11)

After raw_score is computed, Claude Opus makes an independent assessment without looking at any of the other layers:

It searches PubMed directly and reads the abstracts
It produces an independent grade with its own reasoning
If Opus's conclusion conflicts sharply with the aggregator → it triggers an escalation:
- counter_evidence honor: if L11 judges it "ineffective" while the aggregator computed B or above → forcibly downgrade to D
- safety_review honor: if L11 detects an active FDA warning → forcibly flag the safety_review status

This layer is designed to catch cases where "the scoring went wrong but a human reading it would immediately see the error."

5Taiwan noise index (Consumer Market Risk)

Independent of the evidence tier, every claim also carries a consumer-market-noise level:

Level	Icon	Meaning
high	🚨	Extremely high marketing intensity in the Taiwan market; PTT/Dcard chatter is chaotic; extra caution needed
medium	⚠️	Moderate marketing, but credible sources exist for cross-checking
low	✨	Clean market information; judgment rests mainly on medical evidence

Why we present them separately: a supplement may be scientifically A tier (effective) yet 🚨 high CMR in the market (over-hyped advertising, wide quality variation among Taiwan brands) — both messages matter, and users need to see them at the same time.

6Status labels

Each claim also carries a status that expresses "how it is handled from a media and regulatory perspective":

Status	Meaning	Count
published	Mainstream consensus supports it; safe to cite	109
published_with_warning	Evidence supports it but with side effects or population limits	355
disputed	Mainstream literature reaches contradictory conclusions	89
counter_evidence	High-quality evidence shows it is ineffective or harmful	65
needs_more_evidence	Not enough evidence to reach a conclusion	45
tw_blackbox	Taiwan-market information is extremely opaque	42
safety_review	FDA / EMA / TFDA has issued an active safety warning	12

7Update cadence

PubMed fetch: monthly (checking for newly published research)
Regulatory-status check: quarterly
TW community check: monthly (PTT / Dcard / Mobile01)
L11 sanity re-run: every six months (or triggered when major evidence appears)
Version tags: each claim page footer is stamped with engine_version + aggregated_at

If a claim has not been re-reviewed by anyone within two years, the page is flagged with a ⚠️ evidence may be outdated notice.

8Accuracy & calibration

We ran a blind audit on ourselves. We randomly drew 5% of the claims (36 of them, 6 from each of the S/A/B/C/D/U tiers) and handed every one to an independent assessor who re-graded it from scratch — re-checking PubMed / Cochrane / FDA / NIH, actively searching for counter-evidence, using the very same public S–U criteria, never seeing the engine's original tier until after grading was done.

Metric	Result
Exact tier match	39% (14 / 36)
Within ±1 tier	75% (27 / 36)
Engine more optimistic than the independent assessor (over-rating direction)	3 (2 of which were actually the engine being right)
Engine more conservative than the independent assessor	19

On a 6-tier ordinal scale, within ±1 tier is the reasonable agreement band — given the same body of evidence, different reviewers routinely differ by one tier. What really matters is the direction of the error.

The most important direction: for health content, the most dangerous error is "over-rating the evidence — calling something effective when it isn't." In that direction the engine almost never errs — out of 36 samples only 1 rating was too lenient (chondroitin × skin aging, since corrected to U via the adjudication described below), and almost all the remaining disagreements were the engine being more cautious than the independent assessor. For YMYL health topics, we would rather under-rate than over-state.

An honest caveat: this was a single independent assessor, not absolute truth — it too may lean optimistic. So we treat every disagreement as a signal that "this claim should be re-examined," not as an automatic instruction to change the tier. Health ratings should not be flipped on a whim by a single opinion.

We have already done the next round: two multi-reviewer adjudication panels (3 independent reviewers per claim + structural tier-floor rules) re-reviewed 44 disputed and "insufficient-data" claims and applied 31 corrections — including 8 downgrades for "high-quality evidence proving it ineffective" (e.g., vitamin E × cardiovascular disease, kava × insomnia), as well as upgrades for definitional therapies (e.g., vitamin C × scurvy U→A, coenzyme Q10 × heart failure U→B). Every corrected claim page includes a "manually adjudicated rating" note and fully preserves the engine's originally computed tier. We re-run the audit periodically and publish updates here.

9Known limitations

We do not pretend this engine is perfect:

L2 PubMed recall is limited — about 80–90% of mainstream literature is covered; obscure journals and preprints may be missed
L10c TW community coverage is biased toward public platforms — private Facebook groups and LINE groups cannot be scraped
Language bias — we mainly fetch English-language literature; some Japanese and Korean studies may be missed
Supplements update fast — new ingredients (such as spermidine / urolithin-A) need several months before there is enough PubMed literature to grade
L11 Opus carries a hallucination risk — PMID back-checking guards against fabrication, but the reasoning text may still be inaccurate

We continuously log these errors and correct the engine.

10Cite this site

If you are a health-content creator, researcher, or developer, you are welcome to cite this site:

@misc{gptdict_2026,
  title  = "{gpt-dict.com}: A Scientific Evidence Database for Supplements",
  author = "Lenny Chen and gpt-dict-engine contributors",
  year   = "2026",
  url    = "https://gpt-dict.com/"
}

Or add this in your article:

<a href="https://gpt-dict.com/claim/{CLAIM_ID}/" rel="external">
  Evidence source: gpt-dict.com
</a>

We hope this database becomes the foundational evidence layer for Taiwan's health-content ecosystem.