Thirty years of bots: why identity wins

In 1990, the web did not really exist. What existed was a network of FTP servers, mailing lists, and Gopher sites, and if you wanted to find something on it you more or less had to already know where it was. Then Alan Emtage, a McGill University student in Montreal, wrote a script called Archie. It indexed FTP directories. It crawled automatically. Nobody blocked it, because the idea of blocking an automated visitor had not occurred to anyone. The network was small, the machines were slow, and automated discovery felt like a straightforwardly good thing.

That simplicity lasted about three years.

By 1993, Matthew Gray at MIT had written the World Wide Web Wanderer, often called simply the Wanderer, one of the first true web crawlers. It was designed to measure how fast the web was growing. It did this by crawling repeatedly, and at its peak it was reportedly responsible for somewhere between ten and fifteen per cent of all web traffic. Not because the Wanderer was large. Because the web was still tiny. Server administrators noticed. They complained. And from that complaint, a question emerged that the industry is still trying to answer: what is the correct relationship between an automated visitor and the site it visits?

The first gentleman's agreement

In 1994, a Dutch software engineer named Martijn Koster proposed a solution. The robots exclusion protocol, now universally known as robots.txt, specified that a crawler should check for a file at the root of any domain before crawling it. That file would list the paths the site owner did not want crawled. The crawler would respect those instructions. Koster had seen the Wanderer problem and wanted to give site owners a voice. The solution he came up with assumed that bot operators were reasonable people who would listen.

There was no enforcement mechanism. There was no cryptographic verification. There was no way to distinguish a bot that had read the file from one that had ignored it. Robots.txt was, and remains, a purely advisory document. A note on the front door. You can walk past it.

Most automated visitors did walk past it. The robots.txt specification assumed that people writing crawlers and people running servers had the same basic interests: a functioning web, no unnecessary damage. That assumption held long enough to make the protocol feel like it worked. Then money arrived on the web and the assumption collapsed.

"Robots.txt was a note left on the front door. You could walk past it. Most bots did."

The crawlers that did not walk past it, that genuinely read the file and followed its instructions, were the ones that turned out to matter most. Googlebot, which appeared in 1997 as the crawler for the newly formed Google search engine, became the canonical example. It identified itself clearly in its user-agent string. It crawled at a rate that would not destabilise servers. It honoured robots.txt without exception. These were not heroic decisions. They were practical ones, made by engineers who understood that a crawler without the cooperation of site owners was a crawler that would eventually be blocked everywhere. Trust was the prerequisite for access.

The arms race

The first commercial scraping boom came in the early 2000s. Price comparison sites, data brokers, and aggregators needed content at scale, and they were not inclined to ask politely. They disguised their user-agent strings. They rotated IP addresses. They introduced random delays to mimic human browsing patterns. The crawl-and-block cycle began in earnest.

CAPTCHA was invented in 2000 at Carnegie Mellon, initially as a way to distinguish humans from bots in form submission. The name is an acronym: Completely Automated Public Turing test to tell Computers and Humans Apart. It was effective for about four years. Then the solvers arrived: cheap human labour in outsourced call centres, paid fractions of a cent per solved challenge. reCAPTCHA, acquired by Google in 2009, added audio challenges and distorted text from digitised books, making the solving harder. The human farms adapted. The challenges got harder still. By the early 2010s, CAPTCHA had become a minor tax on human users and a moderate inconvenience for well-resourced bot operators.

JavaScript fingerprinting was the next serious defence layer. A browser visiting a page executes JavaScript. The way it executes, the fonts it reports, the screen dimensions it advertises, the way it handles canvas rendering: all of these produce a fingerprint that is difficult to replicate with a headless browser. Combined with IP reputation scoring and rate limiting, this made life considerably harder for basic scrapers. The scraper response was headless Chrome, which executes real browser JavaScript, and tools like Puppeteer and Playwright that could automate it. And so it went.

Cloudflare was founded in 2009 and began offering a Web Application Firewall as part of its CDN product. Bot defence became a subscription. Previously a site needed its own engineers to build and maintain detection logic. Now you could just pay someone. Cloudflare, Akamai's Bot Manager, PerimeterX (founded 2013), DataDome (founded 2015), Kasada (founded 2017): each new vendor offered more sophisticated behavioural analysis than the last. By 2020, the state of the art was machine learning models scoring cursor entropy, scroll velocity, mouse acceleration curves, and timing patterns across hundreds of signals simultaneously. A bot that moved the mouse in a perfectly straight line at constant speed was trivially identified. A bot that could reproduce the subtle jitter of a human hand moving to a target was not trivial at all.

1990

Archie

Alan Emtage's FTP indexer at McGill. The first automated web crawler. Nobody blocked it because nobody thought to.

1993

The Wanderer

Matthew Gray's web crawler generates 10–15% of all web traffic. Server admins complain. The robots.txt debate begins.

1994

Robots.txt

Martijn Koster proposes the robots exclusion protocol. Entirely advisory, entirely honour-based. No enforcement mechanism then. None now.

1997

Googlebot

Google's crawler identifies itself clearly, crawls slowly, and respects robots.txt without exception. The model that everyone should have copied.

2000

CAPTCHA

Carnegie Mellon's challenge-response system distinguishes humans from bots. Effective for approximately four years before the human solving farms arrive.

2009

Cloudflare WAF

Bot defence becomes a subscription product. The arms race industrialises on both sides.

2020

Behavioural ML

Detection models score cursor entropy, scroll velocity, and hundreds of behavioural signals simultaneously. The bot arms race reaches its most elaborate phase.

2023

AI agent crawlers

GPTBot, ClaudeBot, Gemini crawlers. The pattern repeats: new bot category, new block. But this time the bots are acting on behalf of paying customers.

The lesson nobody learned

Whilst all of this was happening, Googlebot kept doing what it had always done. It identified itself clearly. It respected robots.txt. It crawled at a rate that would not destabilise servers. It never attempted to look like a human browser. It never rotated its user-agent string to evade detection. It did not need to.

The result was that every major WAF vendor, every CDN, every bot detection platform added Googlebot to their default allowlist. Not because Google has special legal status. Not because of a formal agreement between Google and the CDN industry. But because Googlebot had spent years demonstrating predictable, honest behaviour, and the operators of those platforms trusted it. Trust was earned, incrementally, through consistent behaviour over a long period.

The irony is that scrapers spent twenty years trying to look like Googlebot whilst also trying to hide from the detection systems that Googlebot never triggered. They spoofed the Googlebot user-agent string. They tried to replicate Googlebot's crawl patterns. They did everything to appear to be Googlebot except the one thing that would have actually worked: being trustworthy. Googlebot's advantage was never the user-agent string. It was the track record behind it.

"Scrapers spent two decades trying to look like Googlebot whilst hiding from the same detection systems Googlebot never triggered. The spoofing was the mistake."

There is a simpler way to say this. Googlebot solved the crawler-versus-defence problem in 1997 by refusing to play the adversarial game. It announced itself honestly, behaved predictably, and accumulated trust. That trust became structural. The scraper industry looked at that outcome and concluded the lesson was about user-agent strings rather than identity. It was not. It was never about user-agent strings.

The AI agent wave

In August 2023, OpenAI published documentation for GPTBot, the crawler used to fetch web content for training and retrieval. A robots.txt directive to block it: User-agent: GPTBot / Disallow: /. Within weeks, a notable number of publishers had added exactly that to their sites. Anthropic's ClaudeBot and Google's various Gemini-associated crawlers followed, each attracting the same defensive response.

Cloudflare's 2024 bot traffic reports showed AI crawlers at around a third of all automated requests, up from near-zero in 2022. Site owners, particularly news publishers, began blocking AI crawlers by default. The pattern was the same as price scrapers in 2003 and data brokers in 2010. New bot category. New block. The web had seen this before and knew exactly what to do about it.

Something is different this time, though.

Price scrapers and data brokers did not represent customers. They were extracting value from sites without returning any. A site that blocked every price scraper lost nothing except the cost of serving those requests. But AI agents in 2025 are increasingly acting on behalf of users who want to buy things. Someone asks their AI assistant to find and order a particular item. The agent needs product data, availability, and in some cases needs to complete a transaction. If the site blocks the agent, the sale does not happen. The site has blocked a customer, just one who arrived through an automated intermediary rather than a browser tab.

Blocking a scraper costs nothing. Blocking an agent that represents purchase intent costs revenue. That distinction has turned the question of trustworthy versus extractive bots from a technical nuisance into a commercial problem, which means it will actually get solved.

The identity infrastructure moment

The industry is slowly arriving at the answer that was available in 1997. Identity, not evasion.

Cloudflare's Verified Bots programme lets operators submit their crawlers for verification. Cloudflare validates the bot's identity through DNS resolution and keeps a list of verified bots that its WAF treats as trusted by default. The mechanism is not cryptographic, but the principle holds: a bot that submits to verification and maintains consistent behaviour gets a designation that bypasses the adversarial detection layer. Googlebot has been on that list for years. Bingbot. A growing number of AI agents.

Akamai's Bot Manager has a similar "known good bots" list, built from operators who have registered their crawlers and agreed to behavioural standards. A registered bot gets through. An unregistered one faces the full detection stack.

Beyond allowlists, something more technically durable is taking shape. The /.well-known/agent-identity.json pattern, borrowing from RFC 8615, gives agents a canonical place to publish their identity, operator, stated purpose, and a public key. Pair that with request-level Ed25519 signing, using headers like Agent-Signature, Agent-Id, and Agent-Purpose, and a receiving server can verify cryptographically that a request is genuine. Not "this looks like a known bot." Verified.

/.well-known/agent-identity.json — example structure

{
  "agent_id": "lighthouse-scanner-v2",
  "operator": "Aidō Labs Ltd",
  "purpose": "AI readiness scanning on behalf of site owner",
  "policy_url": "https://aido-lighthouse.com/scanner-policy",
  "public_key": {
    "kty": "OKP",
    "crv": "Ed25519",
    "x": "<base64url-encoded public key>"
  },
  "robots_txt": "respected",
  "rate_limit": "1 req/s per domain"
}

Visa's Token Authentication Programme, known as TAP, takes a related approach from the payments direction. Where agent identity manifests address the question of "who is this bot," TAP addresses the question of "what has the user actually authorised." A verifiable intent token, cryptographically signed at the point of user consent, travels with the agent's requests and allows a merchant's systems to confirm that a real, identified user has authorised a specific action. The agent is not acting autonomously. It is acting on delegated, verifiable authority.

None of these have reached the ubiquity of robots.txt. Some are proposals. Some are in early deployment. But they are all pointing the same way: the answer to the bot problem is not better stealth, and it is not more elaborate detection. It is cryptographic identity. Prove who you are, state your purpose, get cleared.

This is what Googlebot did with plain text in 1997. We are rebuilding it with public-key cryptography because plain text turned out to be too easy to spoof. The principle has not changed at all.

What this means for Aidō Lighthouse

Aidō Lighthouse is a scanner. We visit sites, analyse their machine-readability, and score how well they will work for AI agents. That puts us in the same position every scanner has been in since 1993: we are an automated visitor, and many of the sites we visit have systems that treat automated visitors as threats by default.

We had two options. Build a better fingerprint spoofer — a headless browser that looked more convincingly human, smarter IP rotation, randomised timing. That works for a while. Until the detection systems update, and then you are back in the arms race with no exit.

We went the other way. We published an identity manifest at /.well-known/agent-identity.json, set up a scanner policy at /scanner-policy, signed our requests with Ed25519, and submitted to Cloudflare's verified bot programme. Every request we make carries an Agent-Id header and a signature. We respect robots.txt without exception. We crawl at one request per second per domain.

The practical reason: a verified identity does not trigger detection systems. It does not need updating when Cloudflare ships a new behavioural model. It accumulates trust rather than spending it. The scraper arms race is a treadmill; identity is a foundation.

The other reason: we tell clients they need to build infrastructure that AI agents can actually use. We tell them to implement structured data, expose APIs, treat machine clients as a first-class audience. We cannot credibly say that whilst running our own scanner as though trust were something to hide from. We should model the behaviour we are recommending. So we do.

"The scraper arms race is a treadmill. Identity is a foundation."

The rest of the industry is reaching the same conclusion, slowly and somewhat reluctantly, because it requires giving up the appeal of invisibility. There is something seductive about a bot that cannot be detected. It feels powerful. It is, in practice, one detection update away from being blocked everywhere. A bot that is known, identified, and trusted is less exciting and considerably more durable.

The shape of what comes next

The next few years will show whether the AI agent wave repeats the scraper era's mistakes or does something different. The technical infrastructure for identity is being built: manifest files, signed headers, verified bot programmes, intent tokens. Whether it gets adopted depends on whether agent operators and site operators can coordinate before the adversarial pattern becomes the default.

The commercial incentives are better aligned than they were during the scraper era. A site accessible to AI agents has a new distribution channel. An agent with a verified identity gets access an anonymous bot would not. The scraper economy was adversarial because the scrapers wanted something sites did not want to give. The agent economy, done right, is one where agents deliver customers and sites deliver products. Both sides need the transaction to work.

The pattern from the last thirty years is clear enough. The bots that survived and thrived were the ones that announced themselves honestly and behaved consistently. The ones that tried to be invisible had to keep running. Googlebot is still crawling. Most of the sophisticated scraper operations from 2010 are gone, blocked into irrelevance or simply outpaced by the detection systems they never stopped fighting.

AI agents are not scrapers. But they will be treated like scrapers until they demonstrate otherwise. The demonstration requires identity, not cleverness. The web figured this out once. It would be good to not spend another twenty years learning it again.

Is your site ready for AI agents?

Run a free scan to see how machine clients experience your site, and what is blocking them from working effectively.

Scan your site