Google wants to open source web crawlers

Image credit: Shutterstock (Image credit: Shutterstock)

In an effort to push for an official web crawler standard, Google has made its robots.txt parsing and matching library open source with the hope that web developers will soon be able to agree on a standard for how web crawlers operate online.

The C++ library is responsible for powering the company's own web crawler Googlebot which is used for indexing websites in accordance with the Robots Exclusion Protocol (REP). Through REP, website owners are able to dictate how web crawlers that visit their sites to index them should behave.

Using a text file called robots.txt, web crawlers such as Googlebot know which website resources can be visited and which can be indexed.

The rules for REP were written by the creator of the first search engine, Martijn Koster, 25 years ago and since that time REP has been widely adopted by web publishers but has never become an official internet standard. Google is looking to change this and hopes to do so by making the parser used to decode its robots.txt file open source.

REP

In a blog post, Henner Zeller, Lizzi Harvey and Gary Illyes explained how the fact that REP not being an official internet standard has led to confusion about how to implement it among web developers, saying:

“The REP was never turned into an official Internet standard, which means that developers have interpreted the protocol somewhat differently over the years. And since its inception, the REP hasn't been updated to cover today's corner cases. This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly.”

To help make REP implementations more consistent across the web, Google is now pushing to make the REP an Internet Engineering Task Force standard and the search giant has even published a draft proposal to help its efforts.

The proposed draft suggests expanding robots.txt from HTTP to any URI-based transfer protocol (including FTP and CoAP), requiring developers to parse at least 500 kibibytes of a robots.txt file and a new maximum caching time of 24 hours.

“RFC stands for Request for Comments, and we mean it: we uploaded the draft to IETF to get feedback from developers who care about the basic building blocks of the internet. As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right,” Zeller, Harvey and Illyes added.

Via The Register

TOPICS
Anthony Spadafora

After working with the TechRadar Pro team for the last several years, Anthony is now the security and networking editor at Tom’s Guide where he covers everything from data breaches and ransomware gangs to the best way to cover your whole home or business with Wi-Fi. When not writing, you can find him tinkering with PCs and game consoles, managing cables and upgrading his smart home. 

Latest in Pro
An image of network security icons for a network encircling a digital blue earth.
Why multi-CDNs are going to shake up 2025
A stylized depiction of a padlocked WiFi symbol sitting in the centre of an interlocking vault.
Broadcom warns of worrying security flaws affecting VMware tools
URL phishing
HaveIBeenPwned owner suffers phishing attack that stole his Mailchimp mailing list
Ransomware
Cl0p resurgence drives ransomware attacks to new highs in 2025
Millwall FC The Den
The UK's first football club mobile network is here - but you probably won't guess which team has launched it
Google Chrome
Google Chrome security flaw could have let hackers spy on all your online habits
Latest in News
inZOI promotional material.
inZOI has become the most wishlisted game on Steam, but I wouldn't get too caught up in the hype
Xbox Series X and Xbox wireless controller set to a green background
Xbox Insiders are currently testing a new Game Hub feature that looks useful, but I've got mixed feelings about it
A stylized depiction of a padlocked WiFi symbol sitting in the centre of an interlocking vault.
Broadcom warns of worrying security flaws affecting VMware tools
Microsoft Surface Laptop and Surface Pro devices on a table.
Hate Windows 11’s search? Microsoft is fixing it with AI, and that almost makes me want to buy a Copilot+ PC
Oura Ring 4
Activity tracking on Oura Ring is about to get a whole lot better, but I've got bad news about your step count
Google Pixel Buds Pro 2
Cleaned your Pixel Buds Pro 2 recently? If not, you might be getting worse sound