Does Schema Markup Predict AI Citation? A Cross-Platform Empirical Study of Structured Data and Generative Engine Optimization
Kurt Fischman
PAPER · v1.0 · 2026-02-22 · human
Abstract
This study examines whether JSON-LD schema markup independently predicts the probability that a web page will be cited in AI-generated responses. We collected 730 AI citations from ChatGPT (GPT-4o with web browsing) and Gemini (1.5 Pro with search grounding) across 75 commercial queries spanning five categories: SaaS and Technology, Health and Medical, Finance and Insurance, Professional Services, and How-To and DIY. Google top-10 organic results for the same queries were collected via SerpAPI as a control set, yielding 1,006 total unique pages analyzed for schema characteristics and domain authority (Ahrefs DR). Initial pooled analysis produced a significant negative association between schema presence and AI citation (OR = 0.546, p < .001) — suggesting schema actively reduced citation probability. This finding proved to be a methodological artifact: Google's ranking algorithm systematically enriches top-10 organic results for schema-bearing pages, inflating schema prevalence in the non-cited control population. A within-Google diagnostic revealed that schema prevalence among AI-cited and non-cited Google pages was statistically indistinguishable (43.1% vs. 44.8%), collapsing the apparent effect entirely. Corrected models using Generalized Estimating Equations with query-clustered standard errors produced a null result for schema presence (OR = 0.678, p = .296), entity richness score (OR = 1.001, p = .833), and schema-to-query alignment (OR = 1.068, p = .626). The dominant predictor of AI citation was Google organic rank position (OR = 0.762 per position, p < .001). Position-1 pages were cited in 43% of queries in which they appeared, declining to 5% at position 7. This gradient implies that each rank position reduces citation odds by approximately 24%, and that AI citation behavior is substantially mediated by the search backend ranking that precedes AI-level content evaluation. One significant exception emerged: pages implementing Product or Review schema with populated concrete attribute fields — pricing, aggregateRating, specifications — were cited at substantially higher rates than pages implementing generic schema types such as Article, Organization, or BreadcrumbList (61.7% vs. 41.6%, p = .012). This attribute-rich advantage was most pronounced among lower-authority domains (DR ≤ 60), consistent with the interpretation that factual payload in structured data partially compensates for weak authority signals.