I would first extract all the "a href" tags with regex, xpath, or this:
http://simplehtmldom.sourceforge.net/
Then detect which urls contain search engine keywords or domains.
Something like this (Typed out fast, did not test):
Code:
$href_array = array('<a href="http://google.com">google</a>', '<a href="http://www.bing.com">bing</a>', 'etc..');
$search_engines = array('bing.com', 'google.com', 'etc...');
$i = 0;
foreach($href_array as $link) {
foreach($search_engines as $site){
if(strpos($link, $site) !== FALSE){
// SE link found
$final[$i] = $link;
$i++;
}
}
}
echo '<pre>';
print_r($final);