GoFuckYourself.com - Adult Webmaster Forum - View Single Post

BJ · 09-17-2002, 11:15 PM

How should I request that Google not crawl part or all of my site?

The standard for robot exclusion given at http://www.robotstxt.org/wc/norobots.html provides for a file called robots.txt that you can put on your server to exclude Googlebot and other web crawlers. (Googlebot has a user-agent of "Googlebot".)

Googlebot also understands some extensions to the robots.txt standard. Disallow patterns may include * to match any sequence of characters, and patterns may end in $ to indicate the end of a name. For example, to prevent Googlebot from crawling files that end in .gif, you may use the following robots.txt entry:

User-Agent: Googlebot
Disallow: /*.gif$
Please note that Googlebot does not interpret a 401/403 response ("Unauthorized"/"Forbidden") to a robots-txt fetch as a request not to crawl any pages on the site. To prevent Googlebot and other web crawlers from crawling any page on your site, you may use the following robots.txt entry:

User-Agent: *
Disallow: /
Please note also that each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, if you wanted to allow all filetypes to be served via http but only .html pages to be served via https, the robots.txt file for the http protocol (http://yourserver.com/robots.txt) would be:

User-Agent: *
Allow: /
The robots.txt file for the https protocol (https://yourserver.com/robots.txt) would be:

User-Agent: *
Disallow: /
Allow: /*.html$
Another standard which is more convenient for page-by-page use involves adding a <META> tag to an HTML page to tell robots not to index the page or not to follow the links it contains. This standard is described at http://www.robotstxt.org/wc/exclusion.html. You may also want to read what the HTML standard has to say about these tags. Remember that changing your server's robots.txt file or changing the <META> tags on its pages will not cause an immediate change in the results that Google returns, since your changes must propagate to Google's next index of the web before being reflected in Google search results.

09-17-2002, 11:15 PM
BJ Confirmed User Join Date: Mar 2002 Location: asia Posts: 5,590	How should I request that Google not crawl part or all of my site? The standard for robot exclusion given at http://www.robotstxt.org/wc/norobots.html provides for a file called robots.txt that you can put on your server to exclude Googlebot and other web crawlers. (Googlebot has a user-agent of "Googlebot".) Googlebot also understands some extensions to the robots.txt standard. Disallow patterns may include * to match any sequence of characters, and patterns may end in $ to indicate the end of a name. For example, to prevent Googlebot from crawling files that end in .gif, you may use the following robots.txt entry: User-Agent: Googlebot Disallow: /.gif$ Please note that Googlebot does not interpret a 401/403 response ("Unauthorized"/"Forbidden") to a robots-txt fetch as a request not to crawl any pages on the site. To prevent Googlebot and other web crawlers from crawling any page on your site, you may use the following robots.txt entry: User-Agent: Disallow: / Please note also that each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, if you wanted to allow all filetypes to be served via http but only .html pages to be served via https, the robots.txt file for the http protocol (http://yourserver.com/robots.txt) would be: User-Agent: * Allow: / The robots.txt file for the https protocol (https://yourserver.com/robots.txt) would be: User-Agent: * Disallow: / Allow: /*.html$ Another standard which is more convenient for page-by-page use involves adding a <META> tag to an HTML page to tell robots not to index the page or not to follow the links it contains. This standard is described at http://www.robotstxt.org/wc/exclusion.html. You may also want to read what the HTML standard has to say about these tags. Remember that changing your server's robots.txt file or changing the <META> tags on its pages will not cause an immediate change in the results that Google returns, since your changes must propagate to Google's next index of the web before being reflected in Google search results.