Quote:
Originally Posted by Killswitch
First you use php to grab the source code of the page, then use the regex to browse that code to strip out the email addresses.
|
To elaborate on what he said, what you'd normallly do is something along the lines of setting up a script that takes an url as input, downloads whatever is at that url (usually the homepage), strips out all links using a regexp and saves those, strips out all email addresses using another regexp and saves those too. Then, it uses the links found to determine new pages to repeat the process with - only those on the same domain if you're just getting all the email addresses on that site, or all if you just want to keep finding new email addresses on new sites forever.
Personally, I'd go for another language than php for this, but really, it can be done in pretty much any programming language.
Set a bot like that loose on a big directory, and you'll eventually build up a list of millions of email addresses. Of course, others do the same thing as well, so the email addresses won't exactly be fresh.
Keep in mind that site owners might have email harvester traps, which generate a list of random invalid email addresses and generate dynamic links to themselves as well, ensuring that if your harvester bot isn't protected from them, it will keep getting new invalid email addresses from them forever.