Quote:
Originally Posted by Barry-xlovecam
It's really cat and mouse. UFW or iptables -- firewall them out -- if you have root. However, they will change IPs or AS networking so it is a never ending game.
|
I have a site that's scraped to hell and back. If you exclude Googlebot and all of the scrapers, there's probably less than 2% remaining (loads by a browser).
Over the years I've added bits and pieces to log various interesting information. The big red flag that sticks out, at least for my site: scrapers use proxies, so their IPs can change without notice, but the headers they send are usually a fixed pattern that is nothing like a real browser, so they're super easy to block.
Even a simple CAPTCHA that is triggered after say 10 loads without presenting a cookie manages to block most of them. Some IPs are constantly bashing at the site, day after day, even though they are almost perpetually 403'd or firewalled.
Guess there is a market for a service like this, if one doesn't exist... but integrating it into a customer's existing site would be interesting...