GoFuckYourself.com - Adult Webmaster Forum

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)
-   Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)
-   -   How can you stop THIEVES SCRAPING your SITES? (https://gfy.com/showthread.php?t=924370)

CunningStunt 08-28-2009 01:26 AM

How can you stop THIEVES SCRAPING your SITES?
 
I have a content rich site that this is happening to more and more.

You can ban their IP addresses, but they soon pop up with another.

Is their any way to lock down the server better, yet still allow the friendly SE bots? It's becoming an increasing problem. :mad:

cLin 08-28-2009 01:34 AM

We use strongbox which says it has an anti scraping technology but to be honest, I've never dealt with it. The logic they use makes sense though.

http://bettercgi.com/strongbox/features.html#antislurp

CunningStunt 08-28-2009 02:59 AM

Thanks Chris.

I forgot to mention the site is written in asp.net, don't know how different methods work exactly with different server setups.

Knowing nothing much about server security, I would just have a list of all known "friendly" search engine bots, and everything else gets fooked out if there are multiple session attempts. Is that how it works?

I should contact my dedicated host, I'm sure they'll be able to recommend something, that's their speciality after all.

Klen 08-28-2009 03:59 AM

Well,one of methods which i use it is to delete old content and replaced it with completely new content on different location.

CunningStunt 08-28-2009 04:07 AM

Quote:

Originally Posted by KlenTelaris (Post 16245232)
Well,one of methods which i use it is to delete old content and replaced it with completely new content on different location.

Almost all my content is ranking and linked to. How can doing 301's etc help? I don't want to be shifting stuff around all the time.

I'd like to simply shut off programmatic attempts to access my sites from unknown IP addresses, period.

Dirty Dane 08-28-2009 04:28 AM

Is it a paysite or free? And when you say "thieves".. are they duplicating your website or just download content?

CunningStunt 08-28-2009 04:41 AM

Quote:

Originally Posted by Dirty Dane (Post 16245377)
Is it a paysite or free? And when you say "thieves".. are they duplicating your website or just download content?

Free site.

First instance they had an autoscraper on a daily scrape to pull the new entries and posting to a cough **new** site, which I soon had closed down via dmca. Now I've caught it at an earlier stage.

SeanLEE 08-28-2009 04:46 AM

Just put advertisements in your player- then they are advertising for you.
:thumbsup

quantum-x 08-28-2009 04:49 AM

Honestly, there's not much you can do.
If you can see it, you can steal it.... cookies, restrictions, captchyas etc can all be defeated.

The best suggestions are: watermark, adverts, etc.

One trick that really does stump most spiders, however, is to link to your content via JS or CSS.

Dirty Dane 08-28-2009 05:27 AM

Quote:

Originally Posted by CunningStunt (Post 16245484)
Free site.

First instance they had an autoscraper on a daily scrape to pull the new entries and posting to a cough **new** site, which I soon had closed down via dmca. Now I've caught it at an earlier stage.

Ok....

Banning IPs won't help much, because they might just use rotating IPs.

Banning known offline user-agents could help, but that is also easy to override (they are sending fake user-agent info).

If you have movies, put it in javascript (the agent usually can't read that) - and text outside (for SE). The downside of this is surfers who disabled javascript.

If you have a decent CPU on your server, then trick their "browser" into fake links with long delays (like a cgi link) or fake targets that temporarily kill and ban too many attempts. This will lag or cut off their agent temporarily. (might be a very good idea to use robots.txt on those links because you do not want to trick google..)

You can also create something that their agent doesn't understand to sort out. The more garbage, the better. For instance, if you have invisible links to the "tubegirl" :winkwink:, they will end up with all kinds of shit, they have to sort out manually.
You can also structure your site in a way that has no logic and is hard to restructure for a software.

You can also watermark your stuff with "licensed to...", but talk with your sponsors before doing that. And if they promote same sponsor, then you should talk about that too :)

faxxaff 08-28-2009 05:30 AM

You can use trap files ... like 1x1 pix big files named 6tjgTTvtfgh.jpg or something like that. If it get's downloaded, you know it's an illegal bot. Now, write a script that will block that user based on IP and session ID or have him download some malicious bullshit ...

rowan 08-28-2009 06:11 AM

I'm facing this problem myself as I have a site with millions of pages. Currently if an IP downloads too many pages within a short period of time and it's NOT on the whitelist (eg the IPs of Google) it gets firewalled for a period of time. It's a pretty aggressive approach but it works (for now)

Most of the people trying to scrape don't bother trying to hide it, so their 3 fetches a second gets picked up pretty quickly.

Dirty Dane 08-28-2009 06:52 AM

Please post the domains of these bastards. Without the http

Agent 488 08-28-2009 07:05 AM

i was going to say dmca but looks like you did it.

please no idiots come in here saying that is your site is online you are consenting for it to be scraped. fuck off.

SmokeyTheBear 08-28-2009 09:20 AM

Quote:

Originally Posted by budsbabes (Post 16246036)
please no idiots come in here saying that is your site is online you are consenting for it to be scraped. fuck off.

lol well i hate to do this but....

Have you given consent for google to scrape you ? yahoo ? bing ? etc.

Basically you have an apple tree in a public place. You aren't stopping anyone from picking your apples, infact you like some people picking your apples ( google ) even though they never asked to pick your apples and furthermore they are showing your apples on their site and making money from it. shitloads of money.

Can't be too suprised when some fatty comes by and picks all your apples one day, after watching everyone else pick them and you not stopping them :)

DonovanTrent 08-28-2009 09:28 AM

Quote:

Originally Posted by CunningStunt (Post 16244739)
I have a content rich site that this is happening to more and more.

I'm just kind of wondering why you have a "content rich site" that is free. But that's just me, I may be missing something here.

Agent 488 08-28-2009 09:32 AM

search engines link to my content. in order for someone to read it they have to click through to my site. there is a difference.



Quote:

Originally Posted by SmokeyTheBear (Post 16246597)
lol well i hate to do this but....

Have you given consent for google to scrape you ? yahoo ? bing ? etc.

Basically you have an apple tree in a public place. You aren't stopping anyone from picking your apples, infact you like some people picking your apples ( google ) even though they never asked to pick your apples and furthermore they are showing your apples on their site and making money from it. shitloads of money.

Can't be too suprised when some fatty comes by and picks all your apples one day, after watching everyone else pick them and you not stopping them :)


seeandsee 08-28-2009 09:32 AM

Quote:

Originally Posted by Dirty Dane (Post 16245731)
Ok....

Banning IPs won't help much, because they might just use rotating IPs.

Banning known offline user-agents could help, but that is also easy to override (they are sending fake user-agent info).

If you have movies, put it in javascript (the agent usually can't read that) - and text outside (for SE). The downside of this is surfers who disabled javascript.

If you have a decent CPU on your server, then trick their "browser" into fake links with long delays (like a cgi link) or fake targets that temporarily kill and ban too many attempts. This will lag or cut off their agent temporarily. (might be a very good idea to use robots.txt on those links because you do not want to trick google..)

You can also create something that their agent doesn't understand to sort out. The more garbage, the better. For instance, if you have invisible links to the "tubegirl" :winkwink:, they will end up with all kinds of shit, they have to sort out manually.
You can also structure your site in a way that has no logic and is hard to restructure for a software.

You can also watermark your stuff with "licensed to...", but talk with your sponsors before doing that. And if they promote same sponsor, then you should talk about that too :)

nice tips

Agent 488 08-28-2009 09:33 AM

what is so hard to understand. you can't think of one content rich free site? :upsidedow

Quote:

Originally Posted by DonovanTrent (Post 16246634)
I'm just kind of wondering why you have a "content rich site" that is free. But that's just me, I may be missing something here.


Agent 488 08-28-2009 09:35 AM

i understand where you are coming from, but there is a difference between a content preview and a full scrape.

Quote:

Originally Posted by SmokeyTheBear (Post 16246597)
lol well i hate to do this but....

Have you given consent for google to scrape you ? yahoo ? bing ? etc.

Basically you have an apple tree in a public place. You aren't stopping anyone from picking your apples, infact you like some people picking your apples ( google ) even though they never asked to pick your apples and furthermore they are showing your apples on their site and making money from it. shitloads of money.

Can't be too suprised when some fatty comes by and picks all your apples one day, after watching everyone else pick them and you not stopping them :)


SmokeyTheBear 08-28-2009 09:44 AM

Quote:

Originally Posted by budsbabes (Post 16246655)
i understand where you are coming from, but there is a difference between a content preview and a full scrape.

google would like you to think that anyways :winkwink: did you know google offers a service that allows users to browse your site without most of your ads? they scrape the entire page on the fly and only display the text.

even without that , all google is doing is cutting your page up and displaying it as seperate items, the only difference is they get to show way more ads in the process than someone who just scrapes the page and repukes it up.

DonovanTrent 08-28-2009 09:58 AM

Quote:

Originally Posted by budsbabes (Post 16246649)
what is so hard to understand. you can't think of one content rich free site? :upsidedow

Depends on the content. I've seen plenty of content-rich sites that should be nowhere near free.

raymor 08-28-2009 10:11 AM

Quote:

Originally Posted by cLin (Post 16244749)
We use strongbox which says it has an anti scraping technology but to be honest, I've never dealt with it. The logic they use makes sense though.

http://bettercgi.com/strongbox/features.html#antislurp

We have Throttlebox, specifically designed for this type of thing.
Throttlebox is an Apache module. The OP says he used ASP.net.
I wonder if that means he's hosting on a Windows desktop instead
of a server OS running Apache.

CunningStunt 08-28-2009 02:47 PM

Quote:

Originally Posted by DonovanTrent (Post 16246634)
I'm just kind of wondering why you have a "content rich site" that is free. But that's just me, I may be missing something here.

Yes you are :winkwink:

There's some good advice in here, thanks for taking the time folks. :thumbsup

raymor 08-29-2009 04:30 PM

Quote:

Originally Posted by faxxaff (Post 16245737)
You can use trap files ... like 1x1 pix big files named 6tjgTTvtfgh.jpg or something like that. If it get's downloaded, you know it's an illegal bot. Now, write a script that will block that user based on IP and session ID or have him download some malicious bullshit ...

That's one of several techniques we use. That technique works sometimes, but
often does not. We don't send people malicious files of course, we're not criminals,
but we do use traps. It's a useful part of a multi-layered approach, but not at all
sufficient on it's own.

raymor 08-29-2009 04:38 PM

Quote:

Originally Posted by SmokeyTheBear (Post 16246597)
lol well i hate to do this but....

Have you given consent for google to scrape you ? yahoo ? bing ? etc.

Yes, he's given Google permission to index, not scrape, the site, and thereby promote it.
If you've been a webmaster for more than a few days, you know about robots.txt.
By choosing not to put up a "no indexing" sign (robots.txt), you've given implied
permission for Google to promote you by adding you to their index. I'd bet the people
scraping (not indexing) the site don't check for robots.txt.

Besides, use a ounce or so of common sense. Obviously webmasters want their
porno sites listed in search engines. Duh.

Angry Jew Cat - Banned for Life 08-29-2009 04:56 PM

the cat thinks most of the people in this thread have a skewed definition of content.

DonovanTrent 08-29-2009 05:27 PM

Quote:

Originally Posted by CunningStunt (Post 16248044)
Yes you are :winkwink:

I guess I was just asking, based on the content being so valuable to you as to be concerned about protecting it. That's all.

GrouchyAdmin 08-29-2009 06:25 PM

One of the ways I've dealt with it is with custom webserver-level applications - never post a direct link to the content. Use a custom hash, decode, and sendfile() the bitch. Otherwise, I've used trivial timestamping and other simple methods to break fuskers. Don't forget to disable supporting HTTP Trace.

SmokeyTheBear 08-30-2009 01:50 AM

Quote:

Originally Posted by raymor (Post 16251984)
Yes, he's given Google permission to index, not scrape, the site, and thereby promote it.

he did ? i didnt see the part where he defined his submittion to google, i thought google just scraped his site like they scrape every site..

Quote:

Originally Posted by raymor (Post 16251984)
If you've been a webmaster for more than a few days, you know about robots.txt.

not a few , just a couple days

Quote:

Originally Posted by raymor (Post 16251984)
By choosing not to put up a "no indexing" sign (robots.txt), you've given implied
permission for Google to promote you by adding you to their index.

lol so by not posting a sign saying " do not break my car windows" you are implying that it's okay to smash your car windows ? ok got it..

by that theory everyone has permission , why would it be implied for google but not implied for others ? is it called the googlerobots.txt ?


Quote:

Originally Posted by raymor (Post 16251984)
Besides, use a ounce or so of common sense. Obviously webmasters want their
porno sites listed in search engines. Duh.

now they do .. before google it was just software downloading everything on your server :)

maybe thats what he wants to do is become so rich and well known you will beg him to come scrape your site just like google.


All times are GMT -7. The time now is 08:25 PM.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123