Google to make robots.txt an Internet standard after 25 years

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Bladewire
    StraightBro
    • Aug 2003
    • 56228

    #1

    Google to make robots.txt an Internet standard after 25 years


    Google demanding more free work & expense from people to bend to their fucking will

    Google to make robots.txt an Internet standard after 25 years

    The Robots Exclusion Protocol (REP) — better known as robots.txt — allows website owners to exclude web crawlers and other automatic clients from accessing a site. “One of the most basic and critical components of the web,” Google wants to make robots.txt an Internet standard after 25 years.

    Despite its prevalence, REP never became an Internet standard, with developers interpreting the “ambiguous de-facto” protocol “somewhat differently over the years.” Additionally, it doesn’t address modern edge cases, with web devs and site owners ultimately still having to worry about implementation today.

    On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large?

    To address this, Google — along with the original author of the protocol from 1994, webmasters, and other search engines — has now documented how REP is used on the modern web and submitted it to the IETF.

    The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP. These fine grained controls give the publisher the power to decide what they’d like to be crawled on their site and potentially shown to interested users. It doesn’t change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web.

    The robots.txt standard is currently a draft, with Google requesting comments from developers. The standard will be adjusted as web creators specify “how much information they want to make available to Googlebot, and by extension, eligible to appear in Search.”

    This standardization will result in “extra work” for developers that parse robots.txt files, with Google open sourcing the robots.txt parser used in its production systems.

    This library has been around for 20 years and it contains pieces of code that were written in the 90’s. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.


    Skype: CallTomNow

  • brassmonkey
    Pay It Forward
    • Sep 2005
    • 77396

    #2
    they want to reduce page removal right?? this is something they have to pay for currently. i think they are trimming the fat to focus on tech items. i use robots on everything
    TRUMP 2026 KEKAW!!! - The Laken Riley Act Is Law!
    DACA ENDED - SUPPORT AZ HCR 2060 52R - email: brassballz-at-techie.com

    Comment

    • trevesty
      Confirmed User
      • Aug 2006
      • 3810

      #3
      Been running websites for over 15 years and making money from it. Tens of thousands of sites at least...

      And every single one of them has had a robots.txt file. I don't see the issue.
      The Fap Guide

      Comment

      • Bladewire
        StraightBro
        • Aug 2003
        • 56228

        #4
        Originally posted by trevesty
        And every single one of them has had a robots.txt file. I don't see the issue.
        It's not going to be the robot.txt that it's always been.

        It will be mandatory and you'll have to add all sorts of parameters that you don't currently have, and likely aren't aware of, and if any of them are null, or if you don't have the robot text file exactly how Google wants it you will be dinged and your SE placement will suffer.


        Skype: CallTomNow

        Comment

        • brassmonkey
          Pay It Forward
          • Sep 2005
          • 77396

          #5
          Originally posted by Bladewire
          It's not going to be the robot.txt that it's always been.

          It will be mandatory and you'll have to add all sorts of parameters that you don't currently have, and likely aren't aware of, and if any of them are null, or if you don't have the robot text file exactly how Google wants it you will be dinged and your SE placement will suffer.
          a sitemap is more complex have no issues of google, bing, or yandex saying change a thing.
          TRUMP 2026 KEKAW!!! - The Laken Riley Act Is Law!
          DACA ENDED - SUPPORT AZ HCR 2060 52R - email: brassballz-at-techie.com

          Comment

          • Bladewire
            StraightBro
            • Aug 2003
            • 56228

            #6
            Originally posted by brassmonkey
            a sitemap is more complex have no issues of google, bing, or yandex saying change a thing.
            We agree


            Skype: CallTomNow

            Comment

            • rowan
              Too lazy to set a custom title
              • Mar 2002
              • 17393

              #7
              Funny how Google is going on about making a de-facto a standard, when they explicitly ignore a fairly important (IMHO) de-facto directive: Crawl-delay.

              Website: I'm asking you nicely to please limit your fetching to once per 60 seconds.

              GoogleBot: No.

              Comment

              • thommy
                Confirmed User
                • Jun 2003
                • 5469

                #8
                Originally posted by brassmonkey
                they want to reduce page removal right?? this is something they have to pay for currently. i think they are trimming the fat to focus on tech items. i use robots on everything
                I think this is just one reason the other is that they don´t get fined for what they show.

                actually Google shows many documents and websites that do not have a robot.txt

                now let´s imagine a funny example:

                a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.

                THE INTERNET law is not existing and google works worldwide under the laws of 255 different countries.
                I think that robots.txt would be the simplest way to allow or deny to crawl and publish
                stuff from a site.

                we can see everywhere in internet that rules and laws are going to an excessive point. users have to agree to cookies (even when this was a common technique for the part 25 years).

                in addition, an internet presence is not necessarily a privilege of companies. consumer protection can also apply here to the site operator.
                Open for handpicked publishers and advertisers:
                www.trafficfabrik.com

                Comment

                • magneto664
                  God Bless You
                  • Aug 2014
                  • 1470

                  #9
                  Originally posted by thommy
                  a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.
                  Every day thousands of others bots scan your website, ahref, majestic, exploit looking bots, advert bots, other shit bots, most of them have loaded default directories and file names or directory paths for scripts working on your site. If you do not want something to appear on the Internet, you do not upload to the internet. Simple.[/QUOTE]

                  Originally posted by thommy
                  I think that robots.txt would be the simplest way to allow or deny to crawl and publish
                  stuff from a site.
                  If in the robot file you select which file or directory to bypass the possible that Google will do. But for others it will be a gift.
                  magneto664 📧 gmail.com
                  Cams.Zone 💘 Best CDN for Adult Content
                  My Fav: 👍 Chaturbate 👍 Stripchat 👍 AdultFriendFinder

                  Comment

                  • Klen
                    • Aug 2006
                    • 32235

                    #10
                    Originally posted by thommy
                    I think this is just one reason the other is that they don´t get fined for what they show.

                    actually Google shows many documents and websites that do not have a robot.txt

                    now let´s imagine a funny example:

                    a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.

                    THE INTERNET law is not existing and google works worldwide under the laws of 255 different countries.
                    I think that robots.txt would be the simplest way to allow or deny to crawl and publish
                    stuff from a site.

                    we can see everywhere in internet that rules and laws are going to an excessive point. users have to agree to cookies (even when this was a common technique for the part 25 years).

                    in addition, an internet presence is not necessarily a privilege of companies. consumer protection can also apply here to the site operator.
                    Average internet user does not have any knowledge about robots and crawling so you cant really expect everyone to follow. A better solution would be , instead crawl robot crawling everything on website, is to have explicitly stated what should be crawled instead.

                    Comment

                    • thommy
                      Confirmed User
                      • Jun 2003
                      • 5469

                      #11
                      Originally posted by magneto664
                      Every day thousands of others bots scan your website, ahref, majestic, exploit looking bots, advert bots, other shit bots, most of them have loaded default directories and file names or directory paths for scripts working on your site. If you do not want something to appear on the Internet, you do not upload to the internet. Simple.
                      but this bots are not google. nobody will try to sue them.

                      I really know how a robots.txt is working but the point is that millions who have an internet presence don´t know.

                      if google crawls something from their site WITHOUT AN EXPLICIT demand to do so, they can be seen as "victim" from the one or other judge and can sue Google for millions.

                      this is why it would make sense to make robots.txt as THE rule to crawl your site and sites without robots.txt would not be touched.




                      If in the robot file you select which file or directory to bypass the possible that Google will do. But for others it will be a gift.
                      as i said - if there are no clear rules for that it will open big doors for lawsuits. and not the others would be the ones that have to fight it - it would be the one who have the money to pay.
                      Open for handpicked publishers and advertisers:
                      www.trafficfabrik.com

                      Comment

                      • thommy
                        Confirmed User
                        • Jun 2003
                        • 5469

                        #12
                        Originally posted by KlenTelaris
                        Average internet user does not have any knowledge about robots and crawling so you cant really expect everyone to follow. A better solution would be , instead crawl robot crawling everything on website, is to have explicitly stated what should be crawled instead.
                        that is exactly what i meant.

                        the laws in the various countries are so different that you can not even decide who is a professional who HAVE to know it and who is not.

                        when the internet started nobody ever thought about such things like privacy and permission to crawl a page. it was simply assumed that everyone who posts something on the internet wants others to find it. this case have changed a lot in the meantime and the views on right or wrong in the world are so completely different that everything have to be EXPLICIT allowed and not just assumed.
                        Open for handpicked publishers and advertisers:
                        www.trafficfabrik.com

                        Comment

                        Working...