View Single Post
Old 03-17-2009, 12:30 AM  
Libertine
sex dwarf
 
Libertine's Avatar
 
Join Date: May 2002
Posts: 17,860
Quote:
Originally Posted by Killswitch View Post
First you use php to grab the source code of the page, then use the regex to browse that code to strip out the email addresses.
To elaborate on what he said, what you'd normallly do is something along the lines of setting up a script that takes an url as input, downloads whatever is at that url (usually the homepage), strips out all links using a regexp and saves those, strips out all email addresses using another regexp and saves those too. Then, it uses the links found to determine new pages to repeat the process with - only those on the same domain if you're just getting all the email addresses on that site, or all if you just want to keep finding new email addresses on new sites forever.

Personally, I'd go for another language than php for this, but really, it can be done in pretty much any programming language.

Set a bot like that loose on a big directory, and you'll eventually build up a list of millions of email addresses. Of course, others do the same thing as well, so the email addresses won't exactly be fresh.

Keep in mind that site owners might have email harvester traps, which generate a list of random invalid email addresses and generate dynamic links to themselves as well, ensuring that if your harvester bot isn't protected from them, it will keep getting new invalid email addresses from them forever.
__________________
/(bb|[^b]{2})/
Libertine is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote