![]() |
How can you stop THIEVES SCRAPING your SITES?
I have a content rich site that this is happening to more and more.
You can ban their IP addresses, but they soon pop up with another. Is their any way to lock down the server better, yet still allow the friendly SE bots? It's becoming an increasing problem. :mad: |
We use strongbox which says it has an anti scraping technology but to be honest, I've never dealt with it. The logic they use makes sense though.
http://bettercgi.com/strongbox/features.html#antislurp |
Thanks Chris.
I forgot to mention the site is written in asp.net, don't know how different methods work exactly with different server setups. Knowing nothing much about server security, I would just have a list of all known "friendly" search engine bots, and everything else gets fooked out if there are multiple session attempts. Is that how it works? I should contact my dedicated host, I'm sure they'll be able to recommend something, that's their speciality after all. |
Well,one of methods which i use it is to delete old content and replaced it with completely new content on different location.
|
Quote:
I'd like to simply shut off programmatic attempts to access my sites from unknown IP addresses, period. |
Is it a paysite or free? And when you say "thieves".. are they duplicating your website or just download content?
|
Quote:
First instance they had an autoscraper on a daily scrape to pull the new entries and posting to a cough **new** site, which I soon had closed down via dmca. Now I've caught it at an earlier stage. |
Just put advertisements in your player- then they are advertising for you.
:thumbsup |
Honestly, there's not much you can do.
If you can see it, you can steal it.... cookies, restrictions, captchyas etc can all be defeated. The best suggestions are: watermark, adverts, etc. One trick that really does stump most spiders, however, is to link to your content via JS or CSS. |
Quote:
Banning IPs won't help much, because they might just use rotating IPs. Banning known offline user-agents could help, but that is also easy to override (they are sending fake user-agent info). If you have movies, put it in javascript (the agent usually can't read that) - and text outside (for SE). The downside of this is surfers who disabled javascript. If you have a decent CPU on your server, then trick their "browser" into fake links with long delays (like a cgi link) or fake targets that temporarily kill and ban too many attempts. This will lag or cut off their agent temporarily. (might be a very good idea to use robots.txt on those links because you do not want to trick google..) You can also create something that their agent doesn't understand to sort out. The more garbage, the better. For instance, if you have invisible links to the "tubegirl" :winkwink:, they will end up with all kinds of shit, they have to sort out manually. You can also structure your site in a way that has no logic and is hard to restructure for a software. You can also watermark your stuff with "licensed to...", but talk with your sponsors before doing that. And if they promote same sponsor, then you should talk about that too :) |
You can use trap files ... like 1x1 pix big files named 6tjgTTvtfgh.jpg or something like that. If it get's downloaded, you know it's an illegal bot. Now, write a script that will block that user based on IP and session ID or have him download some malicious bullshit ...
|
I'm facing this problem myself as I have a site with millions of pages. Currently if an IP downloads too many pages within a short period of time and it's NOT on the whitelist (eg the IPs of Google) it gets firewalled for a period of time. It's a pretty aggressive approach but it works (for now)
Most of the people trying to scrape don't bother trying to hide it, so their 3 fetches a second gets picked up pretty quickly. |
Please post the domains of these bastards. Without the http
|
i was going to say dmca but looks like you did it.
please no idiots come in here saying that is your site is online you are consenting for it to be scraped. fuck off. |
Quote:
Have you given consent for google to scrape you ? yahoo ? bing ? etc. Basically you have an apple tree in a public place. You aren't stopping anyone from picking your apples, infact you like some people picking your apples ( google ) even though they never asked to pick your apples and furthermore they are showing your apples on their site and making money from it. shitloads of money. Can't be too suprised when some fatty comes by and picks all your apples one day, after watching everyone else pick them and you not stopping them :) |
Quote:
|
search engines link to my content. in order for someone to read it they have to click through to my site. there is a difference.
Quote:
|
Quote:
|
what is so hard to understand. you can't think of one content rich free site? :upsidedow
Quote:
|
i understand where you are coming from, but there is a difference between a content preview and a full scrape.
Quote:
|
Quote:
even without that , all google is doing is cutting your page up and displaying it as seperate items, the only difference is they get to show way more ads in the process than someone who just scrapes the page and repukes it up. |
Quote:
|
Quote:
Throttlebox is an Apache module. The OP says he used ASP.net. I wonder if that means he's hosting on a Windows desktop instead of a server OS running Apache. |
Quote:
There's some good advice in here, thanks for taking the time folks. :thumbsup |
Quote:
often does not. We don't send people malicious files of course, we're not criminals, but we do use traps. It's a useful part of a multi-layered approach, but not at all sufficient on it's own. |
Quote:
If you've been a webmaster for more than a few days, you know about robots.txt. By choosing not to put up a "no indexing" sign (robots.txt), you've given implied permission for Google to promote you by adding you to their index. I'd bet the people scraping (not indexing) the site don't check for robots.txt. Besides, use a ounce or so of common sense. Obviously webmasters want their porno sites listed in search engines. Duh. |
the cat thinks most of the people in this thread have a skewed definition of content.
|
Quote:
|
One of the ways I've dealt with it is with custom webserver-level applications - never post a direct link to the content. Use a custom hash, decode, and sendfile() the bitch. Otherwise, I've used trivial timestamping and other simple methods to break fuskers. Don't forget to disable supporting HTTP Trace.
|
Quote:
Quote:
Quote:
by that theory everyone has permission , why would it be implied for google but not implied for others ? is it called the googlerobots.txt ? Quote:
maybe thats what he wants to do is become so rich and well known you will beg him to come scrape your site just like google. |
| All times are GMT -7. The time now is 08:25 PM. |
Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123