Screen scraping: how to stop the internet's invisible data leeches
Automated bots are on the prowl
In general, we believe there is a significant increase in the use of automated bots to gather website content to seed other services, fuel competitive intelligence, and aggregate product details like pricing, features and inventory. Increasingly this information is used to get a leg up over the competition, or to increase website hit rates.
For example, in the travel and tourism industry, price scraping is a real issue as travel sites are constantly looking to beat out the competition by offering the 'best price'. Additionally, the idea of inventory scraping is becoming more common. The concept of bots being used to purchase volumes of a high value item to resell, or to increase online pricing to deter potential buyers.
With the high availability of seemingly legal software bundles and services to facilitate the screen scraping process, and the motives we've just described, it's really a pretty powerful combination.
TRP: How long has screen scraping been going on for and is it becoming more or less of a problem for companies?
AS: Screen scraping has been going on for years but it is only more recently that victims, negatively impacted by this type of behaviour, are beginning to react. Some claim copyright infringement and unfair business practices while in contrast, organizations doing the scraping claim freedom of information.
Many website owners have written usage policies on their sites that prohibit aggressive scraping but have no ability to enforce their policies - the problem doesn't seem to be going away anytime soon.
TRP: How does screen scraping impact negatively on a business's IT systems?
Are you a pro? Subscribe to our newsletter
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
AS: Competitive or abusive Screen scraping is just another example of unwanted traffic. Recent studies show that 61% of Internet Traffic is generated by bots. Bad-bot scrapers consume valuable resources and bandwidth intended to serve genuine web site users, this can result in increased latency for real customers, due to large numbers of non-human visits to the site. The business impact manifests itself as additional IT investment needed to serve the same number of customers.
TRP: Ebay introduced an API years ago to combat screen scraping. Is creating an API to provide access to data a recommended form of defense?
AS: Providing a dedicated API allows "good" scrapers access to your data programmatically and voluntarily observes resource utilization limits however it does not stop malicious information harvesting to be used for competitive advantage.
Real defense can be obtained by taking advantage of technology that can identify and block unwanted non-human visitors to your website. This would allow real or 'good' users to access the site for their intended purposes, while blocking the bad crawlers and bots from causing damage.
TRP: How else can an organisation defend itself from screen scraping?
AS: Using techniques such as IP reputation intelligence, geolocation enforcement, spoofed IP source detection, real time threat-level assessment, request-response behaviour analysis and bi-directional deep packet inspection.
Many organizations today are relying on Corero's First Line of Defense technology block unwanted website traffic including excessive scraping. Corero helps identify human visitors vs. non-human bots (e.g. running scripts) and blocks the unwanted offenders real-time.
TRP: Are there any internet rules governing the use (or misuse) of screen scraping?
AS: Screen scraping has been the topic of some pretty high-profile lawsuits for example Craigslist vs. PadMapper, and in the travel space for example, Ryanair vs. Budget Travel.
However, most court cases to date have not been fully resolved to the satisfaction of the victims. The courts often refuse to grant injunctions for said activity most likely because they have no precedent to work with. This is primarily due to the fact that there few if any internet rules really governing this type of activity.