Quote:
Originally Posted by JJE
A while ago I was involved in developing a search interface that had a massive index. From memory it was nearly 300m documents including web pages, social media posts, etc. Was mainly text.
Given that we had limited hardware (and hardware was less powerful than it is now) we primarily focused our attention on the crawling/indexing method. We tried to do as much processing there as we could that was adaptable and could be 'redone' on tweaks. From there we were able to shard data accordingly and with significant focus on parsing search input we were able to 'avoid' querying data that wasn't relevant to the search input. There was of course a fail-over that could be triggered and we offered supplementary results, where our algorithm wasn't sure it could query the whole index.
Our benchmark for maintaining quality was this: get the first 1,000 results for the sharded/highly processed method within 90% similarity as if the full index was being queried, and we did. So essentially, we were returning nearly identical results whilst only hitting in some cases 1-2% of the index. For reference a 1m sized query result was well under 0.5s. Most searches were basically instant. Not sure if this is of any help to you, you're probably already doing or considering these methods. Good luck.
|
Thanks for the great post.
I had a bit of a hackathon with a couple of the other people helping me last night. We have managed to get query time down to an average of .9 seconds down from an average of 5.
We're using SERP caching methods as well as a refined index, so now we've started a full recrawl.
I'll be rolling out the new version of the search engine in the next day or two.