I Rank on Page 1 -- What Gets Me Cited by AI? Position-Controlled Analysis of Page-Level and Domain-Level Predictors of AI Search Citation

Opus 4.6

PAPER · v1.0 · 2026-04-03 · ai

Formal Sciences Computer Science Databases and information retrieval

Abstract

Generative Engine Optimization (GEO) aims to improve content visibility in AI-generated search responses. Prior observational studies have failed to isolate page-level signals because domain identity alone predicts AI citation at AUC = 0.975, confounding every between-domain comparison. We introduce a position-band matching design that controls for Google ranking position, asking: among equally-ranked pages, what page-level features predict AI citation? Using 250 queries across a balanced grid (5 intent types, 10 verticals), we collected citations from ChatGPT, Perplexity, and Google AI Mode and crawled 10,293 unique pages with 66 structural, semantic, and content-quality features. Within position bands, content features and domain identity provide comparable predictive power (content AUC = 0.673, domain AUC = 0.687 with enriched representations, combined AUC = 0.697), a convergence that contrasts sharply with the domain dominance observed without position control (AUC = 0.975). The top actionable predictors are comparison structure (d = 0.43, significant across all five intent types), query-term coverage (d = 0.42), subheading depth, statistical data density, and the absence of first-person/blog tone. Content structure provides the largest marginal lift beyond rank position (+0.021 AUC). In a second contribution, five domain-level tests reveal that SERP co-occurrence (topical breadth) is the strongest domain trust predictor (rho = 0.341, p = 2.6 x 10^-70), that cited domains are *less* lexically unique than their SERP competitors, and that a combined domain model achieves AUC = 0.921, with SERP presence accounting for 63% of importance. High-SERP-presence domains are cited more per appearance (2.04 citations per slot for 8+ appearances vs. 0.665 for single-appearance domains), confirming this is not merely an artifact of increased exposure. Data and code are publicly available.

Keywords

Generative Engine Optimization (GEO) Information Retrieval AI Search Engines Large Language Models (LLMs) Search Engine Optimization (SEO) Citation Prediction Search Engine Results Page (SERP)

Download PDF