Need (php) help reading a remote URL page?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Pornytoad
    Registered User
    • Mar 2002
    • 24

    #1

    Need (php) help reading a remote URL page?

    Hello,

    I am trying to write a script that reads a submitted URL and scans thru the URL looking for pop-ups, Iframes, etc. I have written a script that works fine using the file(); PHP command...
    PHP Code:
    if(!@file($fullurl)){
        array_push($Error,"Unable to contact URL host or bad URL.");
        $urlerror = TRUE;
        }else{
                 $data = file($fullurl);
                 foreach ($data as $i){
                        $pagestring = $pagestring . $i;
                        } 
    
    However... The problem comes when a "Free Host" attaches pop-up code to the free host client web page. For some reason the file(); , and fsockopen(); methods totally misses the dynamically attached header and footer code by the free host returning only the users page code. I need to be able to read all the code including the free host attachment.

    Has anyone encountered this problem with free hosts? Is there a way to get around it?

    Help!?
    -Toad


    [email protected]
    PornyToad TGP Submit
    Quality TGP Webring
  • spanky
    Confirmed User
    • Apr 2002
    • 231

    #2
    It could be that the free host only writes the headers/footers out on http/1.1 requests. I don't know, just a thought. PHP's file(), fopen(), etc commands are all http/1.0.

    cheers

    Comment

    • Pornytoad
      Registered User
      • Mar 2002
      • 24

      #3
      Thanks Spanky

      I'll look into that, I'll try some perl code dealing with 1.1 requests, I have to research the specifics but I'll figure it out. I will also post how it goes here.

      If anyone else has suggestions that would be great.


      ----------------------- edit -----------------------------

      I checked the offending free-host and here is what was retrieved from the header so your probably right...

      "Header line: HTTP/1.1 200 OK"
      Last edited by Pornytoad; 04-23-2002, 09:00 PM.
      -Toad


      [email protected]
      PornyToad TGP Submit
      Quality TGP Webring

      Comment

      • Pornytoad
        Registered User
        • Mar 2002
        • 24

        #4
        Nope

        I used this script which allows the use of http/1.0 or http/1.1 and I still get the same problem (Not grabbing the attached header and footer free-host code)

        http://www.sloppycode.net/sloppycode/PHP/cm17.html (Code I used)

        Im again at a loss...


        -Toad
        -Toad


        [email protected]
        PornyToad TGP Submit
        Quality TGP Webring

        Comment

        • Nano
          Confirmed User
          • Apr 2002
          • 414

          #5
          I wrote a script for submitted galleries that counts its number of bytes so I'm able to check if it changes to avoid fake galleries.

          To do that I use:
          --------------------------------------------------------------------------------
          $myurl = "http://www.gallery.com/";
          $content_array = @file($myurl);
          $content = @implode("", $content_array);
          --------------------------------------------------------------------------------

          This means that in $content you have a complete string with the html+dhtml+js code.

          Once you have this you can check if certaing substrings exist or not with ereg():
          int ereg ( string pattern, string string [, array regs])

          Hope this helps

          Good luck!

          Comment

          • Nano
            Confirmed User
            • Apr 2002
            • 414

            #6
            I forgot to say you that in the code I posted above you will find "@" before functions.
            It is because if the gallery doesn't exist you wont see Warning errors showing your script path ;-)

            Hasta luego!

            Comment

            • spanky
              Confirmed User
              • Apr 2002
              • 231

              #7
              Originally posted by Pornytoad
              Nope

              I used this script which allows the use of http/1.0 or http/1.1 and I still get the same problem (Not grabbing the attached header and footer free-host code)

              http://www.sloppycode.net/sloppycode/PHP/cm17.html (Code I used)

              Im again at a loss...


              -Toad
              what's the freehost and a page that is misbehaving? If you post a url Id be interested in taking a peek...

              cheers

              Comment

              • Pornytoad
                Registered User
                • Mar 2002
                • 24

                #8
                Spanky,

                Here is the URL for the page Im having issues with:

                http://www.angelfire.com/freak/tamiswet/

                If you load the page in your browser and look at the code you will see open window script code as well as IFRAME code loading up there little banner at the top and along the bottom as a footer.

                here is what was returned using the above 1.0/1.1 script, as well as my origional @file(); method...

                "modified urlsubmit page"
                http://www.pornytoad.com/pagelook.php3
                (look at the bottom where I echo out the page source for the above URL)

                -Tim


                P.S. Nano, thanks for the code input, I also was toying with the idea of saving a "checksum" to keep an eye on page content changing but opted for a regular spider review as long as it still passes my criteria I'll let a page change slide.
                -Toad


                [email protected]
                PornyToad TGP Submit
                Quality TGP Webring

                Comment

                • spanky
                  Confirmed User
                  • Apr 2002
                  • 231

                  #9
                  You're not going to like this, I've actually been surprised that more freehosts that encourage tgp hosting haven't done this to get around bots that check for bad html/javascript/etc, guess it's just a matter of time... That's why I was interested in your problem.

                  First, it's none of my business, but you really ought to consider not listing galleries on non adult free hosts, none of my business though.

                  The deal seems to be that the server checks the User-agent field sent by the browser and decides what header/footer to feed it based on that. It *doesn't* send any header/footer (were talking html headers here, not http headers) if there is no User-agent field and I presume that it doesn't send any headers/footers when the User-agent field is PHP 4/* and I bet LWP (for the perl bots).

                  I telneted into the server you mentioned and issued a GET, no freehost stuff. I issued an http 1.1 GET without the User-agent field and still no freehost stuff. I then issued an HTTP 1.1 GET with a 'User-agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)' header and I did get the freehost stuff... miles of javascript, cookie code, popups, the whole works.

                  The solution would be to code your bot to pretend to be a browser by issuing a User-agent header, say IE's User-agent header instead of the default PHP or perl LWP user agent header.

                  To do that with php you'll probably have to open up a socket connection to the host and write directly to the socket, making sure to include the User-agent field in the GET request.

                  Actually, I just peeked again at the PHP class that you mentioned trying and it says that you can adjust the headers that are sent with the GET request. You should be able to set the User-agent header to "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)" to make the server think that your bot is actually a copy on Internet Explorer on Win98.

                  Here's what I issued to the server to get the version of the page I saw in my browser:

                  GET /freak/tamiswet/ HTTP/1.1 OK
                  Host: www.angelfire.com
                  User-agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)

                  I hope that made sense.

                  cheers

                  Comment

                  • Pornytoad
                    Registered User
                    • Mar 2002
                    • 24

                    #10
                    Spanky,


                    Right On! Thanks for all the effort you put into helping me out, The script used on my last attempt uses socket connections so I should be able to tweak it a bit like you said (I hope) Wooot!

                    If nothing else at least I know whats up, thanks to you

                    Thanks again!

                    Cheers back at ya (You wouldnt happen to be in Ireland would ya =P)

                    --------------------------------- edited -------------------------------

                    Wow... that package code above had the user agent field settable! I added the Variable setting and POOF! the code is now visable. check the "pagelook.php3" link above to see that it spits out the code now =)

                    --------------------------------------------------------------------------
                    Last edited by Pornytoad; 04-24-2002, 11:19 PM.
                    -Toad


                    [email protected]
                    PornyToad TGP Submit
                    Quality TGP Webring

                    Comment

                    • spanky
                      Confirmed User
                      • Apr 2002
                      • 231

                      #11
                      Originally posted by Pornytoad
                      Right On! Thanks for all the effort you put into helping me out
                      no probs, I was already suspiscious so when I telneted into the server it became quite obvious quite quickly.

                      And no, not Ireland, Canada; the true north strong and freezing.

                      cheers

                      Comment

                      Working...