AI Knowledge, Product Databases & Google’s Gemini: How It Really Works
Updated with data and cases: traffic impact stats, EU/UK regulatory actions, and U.S. lawsuits.
Introduction
Artificial intelligence (AI) systems like ChatGPT and Google’s Gemini are changing how people access information.
Instead of clicking through a list of links, users expect direct, conversational answers. That shift raises three big questions:
- Why do AIs like ChatGPT have the knowledge they do?
- Will every product or brand eventually be stored in an AI database?
- When Gemini appears at the top of Google, is it just scraping page-one websites and regurgitating them?
This article answers those questions plainly, then backs it up with fresh data, case examples, and a legal timeline you can cite.
Part 1 — Why does an AI like ChatGPT have this knowledge?
How models learn (not memorize)
Large language models (LLMs) are trained on vast text corpora (licensed data, publicly available web pages, books, documentation, and more).
The training doesn’t store whole documents like a search index;
instead, the model learns statistical patterns in language and concepts.
This lets it generate new sentences that are consistent with what it has learned.
Frozen snapshots + live lookups
An LLM is a snapshot of knowledge as of its training cutoff. To cover recent events or niche items, it can retrieve
information from the web in real time. That combo — core knowledge + retrieval — is why answers can feel both broad and current.
Why it feels like “everything”
- Scale: training spans billions of tokens across many domains.
- Compression: concepts are distilled into parameters rather than stored verbatim.
- Generalization: even when a specific brand isn’t seen, the model can infer from similar examples.
Part 2 — Will every product or brand be stored in an AI database?
Short answer: no — not literally every one.
Products churn constantly. Many local or micro-niche items never get documented in structured public data.
Some product databases are proprietary or behind paywalls. In practice, AI systems will maintain a broad base of mainstream,
well-documented entities and augment it with real-time retrieval for the long tail.
Centralized vs. federated futures
- Centralized knowledge: a large internal model of common entities.
- Federated/dynamic: the AI calls external APIs, product feeds, and websites on demand.
Expect a hybrid: core facts internally + just-in-time lookups. For brands, the practical implication is simple:
make products discoverable and machine-readable (product schema, feeds, authoritative listings).
That maximizes your odds of being included when an AI assembles an answer.
Part 3 — When Gemini appears on Google, is it just scraping and regurgitating?
Not exactly. What you see as “Gemini” in search (often labeled as AI Overviews) is the output of a pipeline:
- Retrieval: Google crawls and indexes the web as usual.
- Ranking/Filtering: relevant, authoritative sources are selected.
- LLM Synthesis: a model (Gemini) synthesizes a short, natural-language answer.
- Citations: links are often shown to underlying sources.
That’s not copy-paste scraping; it’s model-driven synthesis informed by top sources. However, publishers argue the design
diverts clicks from their sites to Google’s summary layer — which brings us to the data and the lawsuits.
Part 4 — What the data says about traffic & clicks
Measured click-through declines when AI Overviews appear
-
Ahrefs (large-scale study): Across ~300k keywords, top-rank CTR dropped by about 34.5% on SERPs
with AI Overviews versus similar informational queries without them.
Source,
coverage,
summary. -
Pew Research (July 22, 2025): Users who saw an AI summary clicked a traditional result in only 8% of visits,
versus 15% when no AI summary appeared — roughly half as often.
Source.
Publisher-level impacts & reporting
-
Penske Media (Rolling Stone, Billboard, Variety) lawsuit: alleges AI Overviews reduce traffic and affiliate revenue;
first major U.S. publisher case.
Reuters,
Axios. -
Media analysis & press: multiple outlets and studies report significant declines in clicks when AI Overviews are present.
Ars Technica,
Fortune.
Bottom line: multiple independent sources now show that when AI Overviews appear, users are much less likely to click through to websites. That’s the core of the publisher backlash.
Part 5 — Legal cases & regulatory actions you can cite
United States
-
Chegg, Inc. v. Google LLC (filed Feb 24, 2025): Antitrust lawsuit alleging AI Overviews siphon traffic and harm revenue.
Docket,
coverage. -
Penske Media Corp. v. Google (filed Sept 13, 2025): First major U.S. publisher suit targeting AI Overviews.
Reuters,
The Verge. -
U.S. v. Google (Ad Tech, EDVA): On April 17, 2025, Judge Leonie Brinkema ruled Google illegally monopolized publisher ad server and ad exchange markets — relevant context for publisher harm claims.
DOJ release,
opinion (PDF),
summary.
France / EU
-
€500m fine (July 13, 2021): Failure to negotiate in good faith with publishers over neighbouring rights.
Reuters. -
€250m fine (Mar 20, 2024): Autorité de la concurrence found Google breached commitments, including use of publisher content to train Bard/Gemini without adequate notice.
Authority press release,
Reuters,
TechCrunch. -
EU antitrust complaint (July 4, 2025): Independent Publishers Alliance files against AI Overviews; requests interim measures.
Reuters,
PYMNTS,
TechCrunch. -
Germany (Sept 18–20, 2025): Media and digital industry alliance files a DSA complaint against AI Overviews for siphoning traffic.
Corint Media,
MLex (coverage).
United Kingdom
-
CMA foundation model reviews (2023–2024): competition/consumer principles for AI markets.
CMA case page,
Initial Report (PDF). -
SMS (DMCCA) process focused on Google Search (2025): Proposed decision documents bring AI Overviews/AI Mode within the scope of “general search,” enabling ex-ante remedies.
overview,
stakeholder response citing CMA docs (PDF). -
Foxglove complaints (Aug 10, 2025): filings with EC and UK CMA arguing AI Overviews are an existential threat to independent news.
WAN-IFRA,
Foxglove.
Part 6 — What this means for brands, SEO & content strategy
- AI readiness beats “just keywords”: use structured data (Product, FAQ, HowTo) and maintain authoritative, updated pages.
- Be the canonical source: original research, documentation, pricing, and specs earn citations in AI answers.
- Own your entity: consistent brand/entity markup across site, profiles, and feeds improves recognition.
- Diversify traffic: build email, social, partnerships, and direct navigation to reduce dependency on any one SERP format.
Direct answers to your three questions
- Why do I (AI) have this knowledge? Because I’m trained on large, diverse text corpora and can blend that with fresh retrieval. I learn patterns and relationships, not verbatim pages.
- Will every product/brand be stored? No. The web changes too fast, and much data is private or undocumented. The winning pattern is hybrid: core knowledge + real-time lookups.
- Is Gemini just scraping the first page? It’s synthesis, not copy-paste. But the synthesis layer sits above organic results and, per multiple studies, draws clicks away from publishers — hence the ongoing legal and regulatory pushback.