View Single Post
Old 03-10-2006, 02:50 PM  
PolySix
Registered User
 
Join Date: Feb 2005
Location: Minnesota
Posts: 19
I use Velocityscape's webscraper. It is rather spendy, but has a lot of bells and whistles to it. It's automation features are rather nice.

If you want free, and don't mind processing offline, then you can use this from the perl handbook:


Code:
#!/usr/bin/perl
  
use HTML::LinkExtor;
  
my $FILENAME = 'file.html';
  
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse_file($FILENAME);
@links = $parser->links;
foreach $linkarray (@links) {
    my @element = @$linkarray;
    my $elt_type = shift @element; # element type
  
    # possibly test whether this is an element we're interested in
    while (@element) {
        # extract the next attribute and its value
        my ($attr_name, $attr_value) = splice(@element, 0, 2);
        if ($elt_type eq 'a' && $attr_name eq 'href') {
            print "ANCHOR: $attr_value\n" 
        }
    }
}
PolySix is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote