Best way to do this? python/ruby/perl/sed/awk/php/etc?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • fris
    Too lazy to set a custom title
    • Aug 2002
    • 55679

    #1

    Best way to do this? python/ruby/perl/sed/awk/php/etc?

    I have a list of links.

    example

    Code:
    <h3>search engine links</h3>
    <a href="http://google.com">google</a>
    <a href="http://www.bing.com">bing</a>
    <a href="http://www.yahoo.com">yahoo</a>
    <h3>payment links</h3>
    <a href="http://www.paypal.com">paypal</a>
    <a href="http://www.paxum.com">paxum</a>
    i only want to get the search engine links being

    Code:
    <a href="http://google.com">google</a>
    <a href="http://www.bing.com">bing</a>
    <a href="http://www.yahoo.com">yahoo</a>
    best way to go about this?

    i was using sed, but its printing the 2nd h3

    Code:
    sed -n '/<h3>/,/<\/h3>/p' test.txt
    any input would be great ;)
    Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence.
  • alcstrategy
    Confirmed User
    • May 2012
    • 124

    #2
    have you tried using xpath?

    i don't know exactly what you're doing but in this case //a[position()<4] will bring the search engines in this case, but i'm sure xpath could handle whatever you wanted to do

    i guess u just want everything after the first h3 tag and probably has dynamic number of links, but it was just an example of using xpath
    Last edited by alcstrategy; 08-02-2012, 10:23 AM.

    Comment

    • kazymjir
      Confirmed User
      • Oct 2011
      • 411

      #3
      Code:
      $ cat test.txt
      <h3>search engine links</h3>
      <a href="http://google.com">google</a>
      <a href="http://www.bing.com">bing</a>
      <a href="http://www.yahoo.com">yahoo</a>
      <h3>payment links</h3>
      <a href="http://www.paypal.com">paypal</a>
      <a href="http://www.paxum.com">paxum</a>
      $ sed -e '/<h3>payment/,/<\/h3>/ d' -e '/<h3>/ d' test.txt
      <a href="http://google.com">google</a>
      <a href="http://www.bing.com">bing</a>
      <a href="http://www.yahoo.com">yahoo</a>
      $
      http://kazymjir.com/

      Comment

      • Zoxxa
        Confirmed User
        • Feb 2011
        • 1026

        #4
        I would first extract all the "a href" tags with regex, xpath, or this: http://simplehtmldom.sourceforge.net/

        Then detect which urls contain search engine keywords or domains.

        Something like this (Typed out fast, did not test):

        Code:
        $href_array = array('<a href="http://google.com">google</a>', '<a href="http://www.bing.com">bing</a>', 'etc..');
        
        $search_engines = array('bing.com', 'google.com', 'etc...');
        
        $i = 0;
        foreach($href_array as $link) {
        	
        	foreach($search_engines as $site){
        		if(strpos($link, $site) !== FALSE){
        			
        			// SE link found
        			$final[$i] = $link;
        			$i++;
        		}
        	}
        
        }
        
        echo '<pre>';
        print_r($final);
        Last edited by Zoxxa; 08-02-2012, 10:58 AM.
        [email protected]
        ICQ: 269486444
        ZoxEmbedTube - Build unlimited "fake" tubes with this easy 100% unencoded CMS!

        Comment

        • kazymjir
          Confirmed User
          • Oct 2011
          • 411

          #5
          Originally posted by Zoxxa
          I would first extract all the "a href" tags with regex, xpath, or this: http://simplehtmldom.sourceforge.net/

          Then detect which urls contain search engine keywords or domains.

          Something like this (Typed out fast, did not test):

          Code:
          $href_array = array('<a href="http://google.com">google</a>', '<a href="http://www.bing.com">bing</a>', 'etc..');
          
          $search_engines = array('bing.com', 'google.com', 'etc...');
          
          $i = 0;
          foreach($href_array as $link) {
          	
          	foreach($search_engines as $site){
          		if(strpos($link, $site) !== FALSE){
          			
          			// SE link found
          			$final[$i] = $link;
          			$i++;
          		}
          	}
          
          }
          
          echo '<pre>';
          print_r($final);
          Zoxxa, sorry, but this makes completely no sense.

          If you know all search engine links ($search_engines array), why do you search them?
          It's like "I *know* that lighbulb and toy car is inside this box, but I will check it anyway".

          Also, what will be if you don't have a link in $search_engines that exists in test.txt ?

          And, why you are firing up PHP, performing DOM/regexp processing, while it can be done with single sed command?
          Last edited by kazymjir; 08-02-2012, 11:10 AM.
          http://kazymjir.com/

          Comment

          • alcstrategy
            Confirmed User
            • May 2012
            • 124

            #6
            if you wanted to use xpath u could use //a[following-sibling::h3[1]]
            but kazymjir's method is probably what you are looking for

            Comment

            • Zoxxa
              Confirmed User
              • Feb 2011
              • 1026

              #7
              Originally posted by kazymjir
              Zoxxa, sorry, but this makes completely no sense.
              To you.

              Originally posted by kazymjir
              If you know all search engine links ($search_engines array), why do you search them? It's like "I *know* that lighbulb and toy car is inside this box, but I will check it anyway".
              Because from his example list he doesn't. He had links with paypal / paxum as well. I suppose he could just select all links between <h3>search engine links</h3> and <h3>payment links</h3> with regex or with xpath, but my code would work with all links he grabs from any section anywhere. If he is only concerned with that block, then regex that part out or xpath.

              Originally posted by kazymjir
              Also, what will be if you don't have a link in $search_engines that exists in test.txt ?
              Like I said,he could use regex to grab the block, or just keep a simple list of search engines he wants to extract.

              Originally posted by kazymjir
              And, why you are firing up PHP, performing DOM/regexp processing, while it can be done with single sed command?
              I code with php, not sed, so obviously my help would be provided with php. I didn't see your post so chill the fuck out allstar.
              Last edited by Zoxxa; 08-02-2012, 11:22 AM.
              [email protected]
              ICQ: 269486444
              ZoxEmbedTube - Build unlimited "fake" tubes with this easy 100% unencoded CMS!

              Comment

              • Barry-xlovecam
                It's 42
                • Jun 2010
                • 18083

                #8
                Originally posted by fris
                I have a list of links.

                example

                Code:
                <h3>search engine links</h3>
                <a href="http://google.com">google</a>
                <a href="http://www.bing.com">bing</a>
                <a href="http://www.yahoo.com">yahoo</a>
                <h3>payment links</h3>
                <a href="http://www.paypal.com">paypal</a>
                <a href="http://www.paxum.com">paxum</a>
                i only want to get the search engine links being

                Code:
                <a href="http://google.com">google</a>
                <a href="http://www.bing.com">bing</a>
                <a href="http://www.yahoo.com">yahoo</a>
                best way to go about this?

                i was using sed, but its printing the 2nd h3

                Code:
                sed -n '/<h3>/,/<\/h3>/p' test.txt
                any input would be great ;)
                Fast and dirty -- should be in strict
                in a foreach loop

                Code:
                Perl
                
                foreach(@_){if ($_=~/href/ig)   {chomp $_; print FILE $_\n";}}

                Comment

                • kazymjir
                  Confirmed User
                  • Oct 2011
                  • 411

                  #9
                  Zoxxa, notice, that Fris provided an example input. example input. It doesn't have to be search engines, it can be anything.
                  He wants to get the content between given <h3>s. He may not know what content is between them, so your code is in this case useless.
                  Search engines are only example, as Fris said. There can be totally random links.

                  Originally posted by Zoxxa
                  I didn't see your post so chill the fuck out allstar.
                  Zoxxa, you should chill out. I didn't have on mind attacking you, I just express my opinion.
                  http://kazymjir.com/

                  Comment

                  • shake
                    frc
                    • Jul 2003
                    • 4663

                    #10
                    I'd use the scrapy python framework

                    http://scrapy.org/
                    Crazy fast VPS for $10 a month. Try with $20 free credit

                    Comment

                    • Zoxxa
                      Confirmed User
                      • Feb 2011
                      • 1026

                      #11
                      Originally posted by kazymjir
                      Zoxxa, notice, that Fris provided an example input. example input. It doesn't have to be search engines, it can be anything.
                      He wants to get the content between given <h3>s. He may not know what content is between them, so your code is in this case useless.
                      Search engines are only example, as Fris said. There can be totally random links.


                      Zoxxa, you should chill out. I didn't have on mind attacking you, I just express my opinion.

                      I apologize, I misread his post where it says "i only want to get the search engine links being".

                      I thought he was being literal and actually meant only urls being search engines, I did not read the text between the h3 tag or his last sed example which shows what he wants.
                      [email protected]
                      ICQ: 269486444
                      ZoxEmbedTube - Build unlimited "fake" tubes with this easy 100% unencoded CMS!

                      Comment

                      • Barry-xlovecam
                        It's 42
                        • Jun 2010
                        • 18083

                        #12
                        http://search.cpan.org/dist/libwww-perl/lwpcook.pod

                        If you need to get the file you can extract at the source ;)

                        or

                        #!/usr/bin/perl

                        use LWP::Simple qw(!head);

                        use HTML::SimpleLinkExtor;
                        my @links = HTML::SimpleLinkExtor->new->parse(get $page)->a;
                        Last edited by Barry-xlovecam; 08-02-2012, 11:51 AM.

                        Comment

                        • fris
                          Too lazy to set a custom title
                          • Aug 2002
                          • 55679

                          #13
                          Originally posted by Zoxxa
                          I apologize, I misread his post where it says "i only want to get the search engine links being".

                          I thought he was being literal and actually meant only urls being search engines, I did not read the text between the h3 tag or his last sed example which shows what he wants.
                          i dont want whats in the h3 tags, i just want to get the links after the h3 tags, but only those links, not the <h3> block of links after those.

                          Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence.

                          Comment

                          • u-Bob
                            there's no $$$ in porn
                            • Jul 2005
                            • 33063

                            #14
                            ugly, but it'll work... and less memory intensive than splitting the file:
                            Code:
                            open(FILE, 'stuff.txt');
                            $h3 = 0;
                            while(<FILE>)
                            {
                              chomp;
                              if($_ =~ "<h3>"){$h3++; if($h3 > 1){close FILE;} }
                              else{print "$_\n";}
                              
                            }
                            close FILE;

                            Comment

                            • fris
                              Too lazy to set a custom title
                              • Aug 2002
                              • 55679

                              #15
                              Originally posted by u-Bob
                              ugly, but it'll work... and less memory intensive than splitting the file:
                              Code:
                              open(FILE, 'stuff.txt');
                              $h3 = 0;
                              while(<FILE>)
                              {
                                chomp;
                                if($_ =~ "<h3>"){$h3++; if($h3 > 1){close FILE;} }
                                else{print "$_\n";}
                                
                              }
                              close FILE;
                              foirgot to mention i wanna specify the links based on the h3 tag so <h3>payment links</h3> would grab those links under that h3 element ;)
                              Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence.

                              Comment

                              • livexxx
                                Confirmed User
                                • May 2005
                                • 1201

                                #16
                                cut and paste into a file editor, replace all <a href=" with \t<a href="
                                cut and paste into excel, select column, done
                                http://www.webcamalerts.com for auto tweets for web cam operators

                                Comment

                                • fris
                                  Too lazy to set a custom title
                                  • Aug 2002
                                  • 55679

                                  #17
                                  Originally posted by livexxx
                                  cut and paste into a file editor, replace all <a href=" with \t<a href="
                                  cut and paste into excel, select column, done
                                  ya thats what i would end up doing ;)
                                  Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence.

                                  Comment

                                  • u-Bob
                                    there's no $$$ in porn
                                    • Jul 2005
                                    • 33063

                                    #18
                                    Originally posted by fris
                                    foirgot to mention i wanna specify the links based on the h3 tag so <h3>payment links</h3> would grab those links under that h3 element ;)
                                    quick mod:

                                    Code:
                                    open(FILE, 'stuff.txt');
                                    $h3 = 0;
                                    $h3str = "payment links";
                                    while(<FILE>)
                                    {
                                      chomp;
                                      if($_ =~ "<h3>$h3str</h3>"){$h3++;}
                                      elsif($_ =~ "<h3>"){$h3 = 0;}
                                      elsif($h3>0){print "$_\n";}
                                    }
                                    close FILE;
                                    this way it will even get the links if they are spread over multiple <h3>payment links</h3> blocks.

                                    Comment

                                    • Barry-xlovecam
                                      It's 42
                                      • Jun 2010
                                      • 18083

                                      #19
                                      Code:
                                        if($_ =~ /"<h3>$h3str</h3>"/ig)
                                      Might work better

                                      Thank the gods -- a biz oriented thread

                                      Comment

                                      • Brujah
                                        Beer Money Baron
                                        • Jan 2001
                                        • 22157

                                        #20
                                        Maybe something along this line:
                                        Code:
                                        echo preg_replace( '|.*</h3>(.*)<h3>.*|s', '$1', $input );

                                        Comment

                                        • Brujah
                                          Beer Money Baron
                                          • Jan 2001
                                          • 22157

                                          #21
                                          I guess if multiple h3 blocks continue this will work:
                                          Code:
                                          echo preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );

                                          Comment

                                          • fris
                                            Too lazy to set a custom title
                                            • Aug 2002
                                            • 55679

                                            #22
                                            Originally posted by Brujah
                                            I guess if multiple h3 blocks continue this will work:
                                            Code:
                                            echo preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );
                                            where is the value for the link block stored to grab the certain block?
                                            Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence.

                                            Comment

                                            • Brujah
                                              Beer Money Baron
                                              • Jan 2001
                                              • 22157

                                              #23
                                              Originally posted by fris
                                              where is the value for the link block stored to grab the certain block?
                                              assign it to a variable, ex. $link_block

                                              Code:
                                              $link_block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );
                                              fwiw, I prefer using xpath too like alcstrategy mentioned, but probably overkill depending on this need.
                                              Last edited by Brujah; 08-04-2012, 02:46 AM.

                                              Comment

                                              • fris
                                                Too lazy to set a custom title
                                                • Aug 2002
                                                • 55679

                                                #24
                                                Originally posted by Brujah
                                                assign it to a variable, ex. $link_block

                                                Code:
                                                $link_block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );
                                                fwiw, I prefer using xpath too like alcstrategy mentioned, but probably overkill depending on this need.
                                                cause the input is coming from file_get_contents

                                                Code:
                                                $data = file_get_contents('links.txt');
                                                $block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $data );
                                                echo $block;
                                                that displays the 3rd link block everytime
                                                Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence.

                                                Comment

                                                • Brujah
                                                  Beer Money Baron
                                                  • Jan 2001
                                                  • 22157

                                                  #25
                                                  It displays the first block for me, but all I had to go on was your sample links.txt code above.

                                                  Code:
                                                  # cat links.txt
                                                  <h3>search engine links</h3>
                                                  <a href="http://google.com">google</a>
                                                  <a href="http://www.bing.com">bing</a>
                                                  <a href="http://www.yahoo.com">yahoo</a>
                                                  <h3>payment links</h3>
                                                  <a href="http://www.paypal.com">paypal</a>
                                                  <a href="http://www.paxum.com">paxum</a>
                                                  <h3>block three</h3>
                                                  <a href="http://gfy.com">gfy</a>
                                                  <a href="http://php.net">php</a>
                                                  
                                                  # php test.php
                                                  
                                                  <a href="http://google.com">google</a>
                                                  <a href="http://www.bing.com">bing</a>
                                                  <a href="http://www.yahoo.com">yahoo</a>
                                                  
                                                  
                                                  # cat test.php
                                                  <?php
                                                  
                                                  $data = file_get_contents('links.txt');
                                                  $block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $data );
                                                  echo $block;

                                                  Comment

                                                  • fris
                                                    Too lazy to set a custom title
                                                    • Aug 2002
                                                    • 55679

                                                    #26
                                                    Originally posted by Brujah
                                                    It displays the first block for me, but all I had to go on was your sample links.txt code above.

                                                    Code:
                                                    # cat links.txt
                                                    <h3>search engine links</h3>
                                                    <a href="http://google.com">google</a>
                                                    <a href="http://www.bing.com">bing</a>
                                                    <a href="http://www.yahoo.com">yahoo</a>
                                                    <h3>payment links</h3>
                                                    <a href="http://www.paypal.com">paypal</a>
                                                    <a href="http://www.paxum.com">paxum</a>
                                                    <h3>block three</h3>
                                                    <a href="http://gfy.com">gfy</a>
                                                    <a href="http://php.net">php</a>
                                                    
                                                    # php test.php
                                                    
                                                    <a href="http://google.com">google</a>
                                                    <a href="http://www.bing.com">bing</a>
                                                    <a href="http://www.yahoo.com">yahoo</a>
                                                    
                                                    
                                                    # cat test.php
                                                    <?php
                                                    
                                                    $data = file_get_contents('links.txt');
                                                    $block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $data );
                                                    echo $block;
                                                    ya see the file is full of h3 sections, i wanna specify which one and it will get those links, so its not just 1 h3, u-bobs works for this

                                                    its chrome bookmarks, so each folder has a h3 heading for the folder name, just wanna get those links for the h3 folder name.
                                                    Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence.

                                                    Comment

                                                    • Barry-xlovecam
                                                      It's 42
                                                      • Jun 2010
                                                      • 18083

                                                      #27
                                                      One idea...

                                                      http://www.perlmonks.org/?node_id=507660


                                                      what is the next starting <tag>text</tag> and always so?

                                                      Comment

                                                      • Brujah
                                                        Beer Money Baron
                                                        • Jan 2001
                                                        • 22157

                                                        #28
                                                        Ah ok. If you're still interested in a php solution, maybe this?

                                                        Code:
                                                        if ( empty( $argv[1] ) ) die( 'Usage: php test.php keyword' . PHP_EOL );
                                                        $fp = fopen( 'links.txt', 'r' );
                                                        while( $line = fgets( $fp ) )
                                                        {
                                                            if ( strpos( $line, '<h3>' ) !== false AND strpos( $line, $argv[1] ) !== false )
                                                            {
                                                                do {
                                                                    $line = fgets( $fp );
                                                                    if ( strpos( $line, '<h3>' ) !== false ) break 2;
                                                                    else echo $line;
                                                                } while ( ! feof( $fp ) );
                                                            }
                                                        
                                                        }
                                                        fclose( $fp );
                                                        Output usage:

                                                        Code:
                                                        ~ $ php test.php
                                                        Usage: php test.php keyword
                                                        ~ $ php test.php search
                                                        <a href="http://google.com">google</a>
                                                        <a href="http://www.bing.com">bing</a>
                                                        <a href="http://www.yahoo.com">yahoo</a>
                                                        ~ $ php test.php pay   
                                                        <a href="http://www.paypal.com">paypal</a>
                                                        <a href="http://www.paxum.com">paxum</a>
                                                        ~ $ php test.php bleh
                                                        <a href="http://php.net">php</a>
                                                        <a href="http://nginx.org">nginx</a>
                                                        
                                                        ~ $ php test.php 'search engine links'
                                                        <a href="http://google.com">google</a>
                                                        <a href="http://www.bing.com">bing</a>
                                                        <a href="http://www.yahoo.com">yahoo</a>
                                                        ~ $ php test.php 'payment links'      
                                                        <a href="http://www.paypal.com">paypal</a>
                                                        <a href="http://www.paxum.com">paxum</a>

                                                        Comment

                                                        • livexxx
                                                          Confirmed User
                                                          • May 2005
                                                          • 1201

                                                          #29
                                                          you could have done it in excel by now or registered a website that parses chrome book marks with php preg_match

                                                          is it just chrome bookmarks? I'll make a damn site to stop reading this
                                                          http://www.webcamalerts.com for auto tweets for web cam operators

                                                          Comment

                                                          • Brujah
                                                            Beer Money Baron
                                                            • Jan 2001
                                                            • 22157

                                                            #30
                                                            Originally posted by livexxx
                                                            you could have done it in excel by now or registered a website that parses chrome book marks with php preg_match

                                                            is it just chrome bookmarks? I'll make a damn site to stop reading this
                                                            I think it's more than just needing a quick solution, and not having one yet. It's a way to learn from others code, and to think of different approaches. Maybe one will suddenly be more elegant or preferable or faster or show an approach that you might not have chosen. Maybe a much better regex is presented. That's usually how I view these code threads.

                                                            Comment

                                                            • fris
                                                              Too lazy to set a custom title
                                                              • Aug 2002
                                                              • 55679

                                                              #31
                                                              great posts guys ;)
                                                              Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence.

                                                              Comment

                                                              • pornsprite
                                                                Confirmed User
                                                                • Dec 2009
                                                                • 1643

                                                                #32
                                                                This is untested and you already have a lot of good examples but I believe you have more options to on the command line this way.

                                                                #!/usr/bin/perl

                                                                die "Usage is $0 <start> <stop> <filename>\n" unless $ARGV[2];

                                                                $start = shift;
                                                                $stop = shift;
                                                                $file = shift;

                                                                open(FILE, "$file") or die "Could not open $file $!\n;
                                                                while(<FILE>){
                                                                chomp;
                                                                $sp = 1 if $_ =~ /$start/;
                                                                die if ($_ =~ /$stop/;
                                                                next if $_ =~ /<h2>/;
                                                                if($sp == 1){
                                                                print "$_\n";
                                                                }
                                                                }
                                                                Go Fuck Yourself

                                                                Comment

                                                                Working...