Worried your brand won’t show up in AI search?

Worried about you products or brands?
Author TGBarker 23 September 2025

AI Knowledge, Product Databases & Google’s Gemini: How It Really Works

Updated with data and cases: traffic impact stats, EU/UK regulatory actions, and U.S. lawsuits.

Introduction

Artificial intelligence (AI) systems like ChatGPT and Google’s Gemini are changing how people access information.
Instead of clicking through a list of links, users expect direct, conversational answers. That shift raises three big questions:

  1. Why do AIs like ChatGPT have the knowledge they do?
  2. Will every product or brand eventually be stored in an AI database?
  3. When Gemini appears at the top of Google, is it just scraping page-one websites and regurgitating them?

This article answers those questions plainly, then backs it up with fresh data, case examples, and a legal timeline you can cite.

Part 1 — Why does an AI like ChatGPT have this knowledge?

How models learn (not memorize)

Large language models (LLMs) are trained on vast text corpora (licensed data, publicly available web pages, books, documentation, and more).
The training doesn’t store whole documents like a search index;
instead, the model learns statistical patterns in language and concepts.
This lets it generate new sentences that are consistent with what it has learned.

Frozen snapshots + live lookups

An LLM is a snapshot of knowledge as of its training cutoff. To cover recent events or niche items, it can retrieve
information from the web in real time. That combo — core knowledge + retrieval — is why answers can feel both broad and current.

Why it feels like “everything”

  • Scale: training spans billions of tokens across many domains.
  • Compression: concepts are distilled into parameters rather than stored verbatim.
  • Generalization: even when a specific brand isn’t seen, the model can infer from similar examples.

Part 2 — Will every product or brand be stored in an AI database?

Short answer: no — not literally every one.

Products churn constantly. Many local or micro-niche items never get documented in structured public data.
Some product databases are proprietary or behind paywalls. In practice, AI systems will maintain a broad base of mainstream,
well-documented entities and augment it with real-time retrieval for the long tail.

Centralized vs. federated futures

  • Centralized knowledge: a large internal model of common entities.
  • Federated/dynamic: the AI calls external APIs, product feeds, and websites on demand.

Expect a hybrid: core facts internally + just-in-time lookups. For brands, the practical implication is simple:
make products discoverable and machine-readable (product schema, feeds, authoritative listings).
That maximizes your odds of being included when an AI assembles an answer.

Part 3 — When Gemini appears on Google, is it just scraping and regurgitating?

Not exactly. What you see as “Gemini” in search (often labeled as AI Overviews) is the output of a pipeline:

  1. Retrieval: Google crawls and indexes the web as usual.
  2. Ranking/Filtering: relevant, authoritative sources are selected.
  3. LLM Synthesis: a model (Gemini) synthesizes a short, natural-language answer.
  4. Citations: links are often shown to underlying sources.

That’s not copy-paste scraping; it’s model-driven synthesis informed by top sources. However, publishers argue the design
diverts clicks from their sites to Google’s summary layer — which brings us to the data and the lawsuits.

Part 4 — What the data says about traffic & clicks

Measured click-through declines when AI Overviews appear

  • Ahrefs (large-scale study): Across ~300k keywords, top-rank CTR dropped by about 34.5% on SERPs
    with AI Overviews versus similar informational queries without them.
    Source,
    coverage,
    summary.
  • Pew Research (July 22, 2025): Users who saw an AI summary clicked a traditional result in only 8% of visits,
    versus 15% when no AI summary appeared — roughly half as often.
    Source.

Publisher-level impacts & reporting

  • Penske Media (Rolling Stone, Billboard, Variety) lawsuit: alleges AI Overviews reduce traffic and affiliate revenue;
    first major U.S. publisher case.
    Reuters,
    Axios.
  • Media analysis & press: multiple outlets and studies report significant declines in clicks when AI Overviews are present.
    Ars Technica,
    Fortune.

Bottom line: multiple independent sources now show that when AI Overviews appear, users are much less likely to click through to websites. That’s the core of the publisher backlash.

Part 6 — What this means for brands, SEO & content strategy

  • AI readiness beats “just keywords”: use structured data (Product, FAQ, HowTo) and maintain authoritative, updated pages.
  • Be the canonical source: original research, documentation, pricing, and specs earn citations in AI answers.
  • Own your entity: consistent brand/entity markup across site, profiles, and feeds improves recognition.
  • Diversify traffic: build email, social, partnerships, and direct navigation to reduce dependency on any one SERP format.

Direct answers to your three questions

  1. Why do I (AI) have this knowledge? Because I’m trained on large, diverse text corpora and can blend that with fresh retrieval. I learn patterns and relationships, not verbatim pages.
  2. Will every product/brand be stored? No. The web changes too fast, and much data is private or undocumented. The winning pattern is hybrid: core knowledge + real-time lookups.
  3. Is Gemini just scraping the first page? It’s synthesis, not copy-paste. But the synthesis layer sits above organic results and, per multiple studies, draws clicks away from publishers — hence the ongoing legal and regulatory pushback.
Category: AI, Search
Previous Post
UK Social Media & Marketing Blog Posts
Next Post
The Hidden Danger Lurking in Your Phone Charger Port

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed