Google bots crawling into new areas of the internet
Search bots learn new tricks to seek out new pages
Google search bots are trialling a new process whereby they automatically enter text into the HTML form boxes of "high-quality sites" in order to reveal the otherwise ‘hidden web’ content on subsequent pages.
Until now Google’s search bots have only been able to see the front page of any given site. If this front page requires users to select, say, their location, so as to re-direct them to a page with country-specific content, Google’s search bots have, thus far, been able to access it.
What lies beneath
Not any more though. Writing on the official Google Webmaster Central blog, Jayant Madhavan and Alon Halevy from Google’s crawling and indexing team have revealed how the search bots are now filling in these forms to see what lies behind.
"In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google," the pair explain, before moving on to how the process works.
"For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML," they explain.
Crawling spiders
Get daily insight, inspiration and deals in your inbox
Sign up for breaking news, reviews, opinion, top tech deals, and more.
According to the authors this is all being done in accordance with "good internet citizenry practices", and "adheres to robots.txt, nofollow and noindex directives". Google also claims that its new and improved crawl only retrieves GET forms and avoids any type of form that might contain personal data.
While Google hopes all this will help to illuminate parts of the internet that are otherwise hidden, not everybody is happy with their explanations.
In fact, the comments section of the official Google blog is awash with queries as to how easy it will be for site owners to stop the search bots from entering parts of their site they don’t want indexed, and whether the new feature will just end up creating endless duplicated content.
Another poster simply asks whether the move is just another way of filling the Google index up with adult content that lies beyond the standard "Yes, I am over 18 years of age, please take me to the pron (sic)" form.
Honestly, some people are so cynical...