View Single Post
Old 08-02-2012, 12:09 PM  
kazymjir
Confirmed User
 
kazymjir's Avatar
 
Industry Role:
Join Date: Oct 2011
Location: Munich
Posts: 411
Quote:
Originally Posted by Zoxxa View Post
I would first extract all the "a href" tags with regex, xpath, or this: http://simplehtmldom.sourceforge.net/

Then detect which urls contain search engine keywords or domains.

Something like this (Typed out fast, did not test):

Code:
$href_array = array('<a href="http://google.com">google</a>', '<a href="http://www.bing.com">bing</a>', 'etc..');

$search_engines = array('bing.com', 'google.com', 'etc...');

$i = 0;
foreach($href_array as $link) {
	
	foreach($search_engines as $site){
		if(strpos($link, $site) !== FALSE){
			
			// SE link found
			$final[$i] = $link;
			$i++;
		}
	}

}

echo '<pre>';
print_r($final);
Zoxxa, sorry, but this makes completely no sense.

If you know all search engine links ($search_engines array), why do you search them?
It's like "I *know* that lighbulb and toy car is inside this box, but I will check it anyway".

Also, what will be if you don't have a link in $search_engines that exists in test.txt ?

And, why you are firing up PHP, performing DOM/regexp processing, while it can be done with single sed command?
__________________
http://kazymjir.com/

Last edited by kazymjir; 08-02-2012 at 12:10 PM..
kazymjir is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote