Having google NOT spider you? How? - GoFuckYourself.com

DrewKole · 09-17-2002, 11:10 PM

Basically, I've got a ton of mirrors of 1) TGP galleries and 2) AVS sites, same site mirrored to different avs's...

Does anyone know a definite way of having google not spider or index a page?

Id like to have all but 1 of my mirrors not-spidered, since google apparently doesnt like mirrors on the same domain, it considers them spam, and who the hell knows where the googlebot might find one of my mirrors.

Any hints on how to get away with the mirroring of pages, without using multiple domains for the same fuckin site? =)

BJ · 09-17-2002, 11:15 PM

How should I request that Google not crawl part or all of my site?

The standard for robot exclusion given at http://www.robotstxt.org/wc/norobots.html provides for a file called robots.txt that you can put on your server to exclude Googlebot and other web crawlers. (Googlebot has a user-agent of "Googlebot".)

Googlebot also understands some extensions to the robots.txt standard. Disallow patterns may include * to match any sequence of characters, and patterns may end in $ to indicate the end of a name. For example, to prevent Googlebot from crawling files that end in .gif, you may use the following robots.txt entry:

User-Agent: Googlebot
Disallow: /*.gif$
Please note that Googlebot does not interpret a 401/403 response ("Unauthorized"/"Forbidden") to a robots-txt fetch as a request not to crawl any pages on the site. To prevent Googlebot and other web crawlers from crawling any page on your site, you may use the following robots.txt entry:

User-Agent: *
Disallow: /
Please note also that each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, if you wanted to allow all filetypes to be served via http but only .html pages to be served via https, the robots.txt file for the http protocol (http://yourserver.com/robots.txt) would be:

User-Agent: *
Allow: /
The robots.txt file for the https protocol (https://yourserver.com/robots.txt) would be:

User-Agent: *
Disallow: /
Allow: /*.html$
Another standard which is more convenient for page-by-page use involves adding a <META> tag to an HTML page to tell robots not to index the page or not to follow the links it contains. This standard is described at http://www.robotstxt.org/wc/exclusion.html. You may also want to read what the HTML standard has to say about these tags. Remember that changing your server's robots.txt file or changing the <META> tags on its pages will not cause an immediate change in the results that Google returns, since your changes must propagate to Google's next index of the web before being reflected in Google search results.

XM · 09-17-2002, 11:31 PM

or use directly meta tags in particular page:
Don't index, but follow links:
META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"

Don't index, don't follow links either:
META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW"

don't forget to close it into brackets, this f**king VB code doesn't allow me to post them...

XM

Pornwolf · 03-21-2003, 04:33 AM

Meta Tags 101. Take notes guys.

Jer · 03-21-2003, 05:18 AM

09-17-2002, 11:10 PM	#1
DrewKole Confirmed User Join Date: Aug 2001 Posts: 5,193	Having google NOT spider you? How? Basically, I've got a ton of mirrors of 1) TGP galleries and 2) AVS sites, same site mirrored to different avs's... Does anyone know a definite way of having google not spider or index a page? Id like to have all but 1 of my mirrors not-spidered, since google apparently doesnt like mirrors on the same domain, it considers them spam, and who the hell knows where the googlebot might find one of my mirrors. Any hints on how to get away with the mirroring of pages, without using multiple domains for the same fuckin site? =)

09-17-2002, 11:31 PM	#3
XM Confirmed User Join Date: Jan 2001 Location: SVK Posts: 406	or use directly meta tags in particular page: Don't index, but follow links: META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW" Don't index, don't follow links either: META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW" don't forget to close it into brackets, this f*king VB code doesn't allow me to post them... XM Last edited by XM; 09-17-2002 at 11:36 PM..*

03-21-2003, 04:33 AM	#4
Pornwolf Drunk and Unruly Join Date: Jan 2002 Location: Hollywood Posts: 22,712	Meta Tags 101. Take notes guys. __________________ I've trusted my sites to them for over a decade... Webair, bitches.

09-17-2002, 11:15 PM	#2
BJ Confirmed User Join Date: Mar 2002 Location: asia Posts: 5,590	How should I request that Google not crawl part or all of my site? The standard for robot exclusion given at http://www.robotstxt.org/wc/norobots.html provides for a file called robots.txt that you can put on your server to exclude Googlebot and other web crawlers. (Googlebot has a user-agent of "Googlebot".) Googlebot also understands some extensions to the robots.txt standard. Disallow patterns may include * to match any sequence of characters, and patterns may end in $ to indicate the end of a name. For example, to prevent Googlebot from crawling files that end in .gif, you may use the following robots.txt entry: User-Agent: Googlebot Disallow: /.gif$ Please note that Googlebot does not interpret a 401/403 response ("Unauthorized"/"Forbidden") to a robots-txt fetch as a request not to crawl any pages on the site. To prevent Googlebot and other web crawlers from crawling any page on your site, you may use the following robots.txt entry: User-Agent: Disallow: / Please note also that each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, if you wanted to allow all filetypes to be served via http but only .html pages to be served via https, the robots.txt file for the http protocol (http://yourserver.com/robots.txt) would be: User-Agent: * Allow: / The robots.txt file for the https protocol (https://yourserver.com/robots.txt) would be: User-Agent: * Disallow: / Allow: /*.html$ Another standard which is more convenient for page-by-page use involves adding a <META> tag to an HTML page to tell robots not to index the page or not to follow the links it contains. This standard is described at http://www.robotstxt.org/wc/exclusion.html. You may also want to read what the HTML standard has to say about these tags. Remember that changing your server's robots.txt file or changing the <META> tags on its pages will not cause an immediate change in the results that Google returns, since your changes must propagate to Google's next index of the web before being reflected in Google search results.

03-21-2003, 05:18 AM	#5
Jer God is Brazilian Join Date: Feb 2001 Location: Brazil Posts: 10,601