Property sellers expect that, when they give an agent their listing information, it will be used to market and sell their property. While that means that agents need to give the listing information exposure on the Internet, it is the agent’s responsibility to take reasonable steps that the data is not ‘scraped’ off of their website and used for illicit purposes, such as direct-marketing the seller, display on unapproved places on the Internet, and other undesirable uses.
Realtors, this means ensuring that your website and software providers have taken steps to ensure that the data is not harvested by malicious software (“bots”). This is mandated by MLS rules for Virtual Office Websites (VOWs) but not yet for other displays. But now that IDX rules are changing to allow sold data, IDX should definitely be re-examined as well.
The scraping issue of yesterday vs. today
The ‘scraping’ issue was the center of attention in our industry a few years ago, when several MLSs went up against a nationwide data ‘scraper’ that foolishly re-posted the stolen data online where it could be found.
Still, it cost these MLSs over 10 million dollars and a lot of time spent in court to get this one scraper to stop. But most of the scrapers’ work product never sees the light of day, so we can’t easily find them and go after them. Proactively stopping the bots is the only way we can deal with this problem.
Where are the bots now a problem?
- MLS systems – past the login, but also prospecting and client collaboration features, and the framed IDX solutions some vendors offer
- MLS/Association consumer facing websites with listings (not to mention the member roster)
- IDX sites
- Virtual Office Websites (VOW)
- Publishers / Portals
Stopping those bots is not easy for a developer or webmaster
Even just a few years ago, it was easier. A bot wouldn’t look (to the web server) like a real web browser. A bot would look at too many listings from one IP address, or look through them faster than any human ever could.
You can still catch a few of the less sophisticated bots by watching for those kinds of things – but most of the scrapers have moved beyond that level of sophistication, and it’s all too easy to block the good bots you want crawling your site, like search engines.
These days, the bots may be written to automate the activities of real web browsers, making it harder to distinguish bot traffic from people traffic. The bots may be deployed on thousands of computers with IP addresses that may belong to, or be re-deployed to, actual legitimate users – so blocking an IP address is no longer effective.
And, instead of looking at thousands of listings from one computer, the bots can now look at just a few listings from many computers – so old fashioned “rate limiting” and review of how many listings were viewed by one computer no longer help us differentiate between bots and real people.
How to protect your site from these bots
There are a variety of companies that specialize in stopping the scrapers’ bots. At one end, there’s Sentor, which is good and used by Realtor.com – but way too expensive for the smaller non-enterprise-level companies that make up our industry. Another is Distil Networks, which has a number of large scale platforms protected as well as MLSs, IDX sites, and brokerages and is highly accurate and effective.
This morning I was just reviewing data on an IDX vendor protected by Distil Networks. I saw that, over the past two months, over 7 million page requests had been made by malicious bots (differentiated from the good search engine bots) – over a million requests made by one bot alone!
And that’s just one IDX vendor that had taken the step to implement an anti-scraping solution. What is going to happen when that vendor’s protections are ratcheted up for blocking bots? I will tell you – the bad guys will move on to an easier target. Is your site an easy target for web scraping? We’re all in this together, and everybody needs to do their part.