View Single Post
Old 07-20-2016, 09:08 PM  
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by JJE View Post
A while ago I was involved in developing a search interface that had a massive index. From memory it was nearly 300m documents including web pages, social media posts, etc. Was mainly text.

Given that we had limited hardware (and hardware was less powerful than it is now) we primarily focused our attention on the crawling/indexing method. We tried to do as much processing there as we could that was adaptable and could be 'redone' on tweaks. From there we were able to shard data accordingly and with significant focus on parsing search input we were able to 'avoid' querying data that wasn't relevant to the search input. There was of course a fail-over that could be triggered and we offered supplementary results, where our algorithm wasn't sure it could query the whole index.

Our benchmark for maintaining quality was this: get the first 1,000 results for the sharded/highly processed method within 90% similarity as if the full index was being queried, and we did. So essentially, we were returning nearly identical results whilst only hitting in some cases 1-2% of the index. For reference a 1m sized query result was well under 0.5s. Most searches were basically instant. Not sure if this is of any help to you, you're probably already doing or considering these methods. Good luck.

Thanks for the great post.

I had a bit of a hackathon with a couple of the other people helping me last night. We have managed to get query time down to an average of .9 seconds down from an average of 5.

We're using SERP caching methods as well as a refined index, so now we've started a full recrawl.

I'll be rolling out the new version of the search engine in the next day or two.
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote