![]() |
Node.XXX - Porn Search Engine
A little project I have been working on is https://node.xxx
It's a search engine that only indexes adult websites and aggressively deals with spam sites, preventing them from being indexed. It will also only index canonical sites, so no white labels get into the search index. At the moment it's a little slow to respond to queries but that will improve as new caching servers are deployed. It's still got a way to go in development but it is live and the current index is around 100 million pages. It supports complex queries which are documented here. The search engine is infinitely scalable and while it's currently crawling html, pdf, json, xml, rss, video and images it's only returning text based results right now. Later this year once I've perfected image search that will be rolled out, with video search to follow around March 2017. Have a look and let me know what you think at https://node.xxx |
How do we submit sites for indexing?
|
Quote:
|
Looks good -- when you get decent traffic hit us up about buying some ads:thumbsup
|
Quote:
|
Quote:
|
Quote:
|
So it shows only sites which are manually submitted to it ?
|
Quote:
There are a lot of crap sites on the adult web and the focus of this search engine is to only index sites of a certain quality. It's not perfect yet, but it's getting better all the time. |
Quote:
|
What's the UA of the crawler so I can whitelist it?
|
Quote:
|
Nice work. I'm working on something similar, but different. My own twists.
|
I regged and added some sites
|
Quote:
If anyone is interested in learning more about how it all works, I have a dedicated Node support channel in my slack team. Just visit Join the Adult Industry community on Slack! to get an auto invite. I'm happy to answer questions and get into technical detail about how it all works and how I've built out the architecture. |
I think it's great that you're always working on something new let's hope this one sticks and does well :thumbsup
|
we did something like that back in 04-06.
search engine is a huge engineering and expensive project. AdultKing, even if you get your queries time down to half it is still slow. At some point working on our SE project we decided to dump MYSQL and had written out own db engine which were way efficient than mysql. For instance 30mg db in mysql only weighted 1.4mg in our engine, querying was ridiculously fast in speed too no matter how huge database was due to our own indexing tech, we could show all results not up to 1000 like every other SE did and does. Good memories and definitely great experience. Our development took ~1.5 years between myself and another programmer working 12-16hrs a day no weekends. The project was wrapped up due to lack of financing, we got engine ready out of beta and were developing webmaster area for buying ads getting ready for marketing when it got stalled. it was written in delphi/php. |
Quote:
It's not going to stick if query times take as long as they do now. Current average time for results to be returned is 5 seconds. I need to get it down to 1.2 seconds max. Otherwise people just won't use it. There's also the challenge of ensuring that the index remains as spam free as possible. I've been working on this project for quite a long time and even launched a search engine years ago which didn't stick - the problems with that were the limitations of processing power and storage - now things are better with better infrastructure options available. |
Quote:
The architecture is all NoSQL, the crawler and search engine are written in C and borrow some of the concepts, but not the code, of Lucene. The ranking algorithm is adaptive and reprocesses the index twice a day. I have development group of servers running where I am tuning the search portion and currently have results within 1.8 seconds max, but I think 1.2 seconds is the sweet spot to make the thing usable. |
nice, good luck :thumbsup
|
we had dynamically updated index cache for our search results
along crawler bots we had bots doing indexing (results caching to be precise), which were updating all relevant indexes for a new page for existing search terms, this way indexes were always uptodate and results displayed very quickly. |
Quote:
The main reason I announced it on GFY tonight was in the hope that people could break it :) |
Quote:
|
Quote:
|
Very nice, AK :thumbsup!
|
Open source has become very powerful, as long as you know how to plug and play libraries together you can do a lot as just a single developer.
|
looks good :thumbsup:thumbsup:thumbsup
|
looking good, when image and video gets added, I think that's when people start using it
|
Quote:
Current priority is to speed up search results. :thumbsup |
how do you rank the results? seems arbitrary... resource pages show up before homepage, etc.
|
Quote:
|
Quote:
can you please email me to discuss something higher level. haze at grandslammedia.com |
what search engine you guys using ?
i used sphinx before but it uses cpu a lot. and i had only ~7 mil records |
Quote:
|
Quote:
|
Quote:
This is clustered nodes of individual components. Crawling is seperate from Indexing. Ranking is seperate from Indexing. Crawls are performed through caching proxies on their own servers. You could do this with Nutch and ElasticSearch but the overhead would be much greater than this system has. |
Quote:
just curious. as search engine site is in my todo list . |
I was searching for "quick fuck" but the load was too slow...
|
A while ago I was involved in developing a search interface that had a massive index. From memory it was nearly 300m documents including web pages, social media posts, etc. Was mainly text.
Given that we had limited hardware (and hardware was less powerful than it is now) we primarily focused our attention on the crawling/indexing method. We tried to do as much processing there as we could that was adaptable and could be 'redone' on tweaks. From there we were able to shard data accordingly and with significant focus on parsing search input we were able to 'avoid' querying data that wasn't relevant to the search input. There was of course a fail-over that could be triggered and we offered supplementary results, where our algorithm wasn't sure it could query the whole index. Our benchmark for maintaining quality was this: get the first 1,000 results for the sharded/highly processed method within 90% similarity as if the full index was being queried, and we did. So essentially, we were returning nearly identical results whilst only hitting in some cases 1-2% of the index. For reference a 1m sized query result was well under 0.5s. Most searches were basically instant. Not sure if this is of any help to you, you're probably already doing or considering these methods. Good luck. |
Quote:
Thanks for the great post. I had a bit of a hackathon with a couple of the other people helping me last night. We have managed to get query time down to an average of .9 seconds down from an average of 5. We're using SERP caching methods as well as a refined index, so now we've started a full recrawl. I'll be rolling out the new version of the search engine in the next day or two. |
As promised a couple of days ago a new version has been rolled out.
Tonight I activated the new Search servers and caching. The index had to be rebuilt to accommodate the changes we made to indexing. Average search time is now down from 5 seconds to 0.9 seconds. Submitting Sites A number of webmasters have been submitting sites that just wont be indexed. In most of these cases the crawler has already been over your site before submission and discarded it for the reasons below. Node.XXX only indexes canonical content. Node.XXX will NOT index embed tube sites, white label dating or white label cam sites. If your site has more than one pop up , malware, more than 4 ads above the fold or a poor user experience it will not be indexed. Node.XXX is all about quality sites. The search engine is designed to filter out sites that are bad for the surfer. So if you have a lot of ads or have more than one popup or a sneaky redirect then your site just won't be indexed. |
What about other equation of formula? Meaning , howmuch money do you plan to spend on marketing or you will do just guerrilla marketing ?
|
Quote:
|
Nice project!
I tried it. My Comment: you need to filter www/http and subdomains... got the exact same search results from different subdomains from tube8, jp. and .de and www. pretty nice though. Reminds me a bit of Free Porn Search Engine :: pornharmony.com that site does some awesome matching content search. The longer you search, the better the results are. Goodluck! |
Sad to see another SE loaded with tubes.
|
Quote:
Quote:
|
Quote:
What should we do with the canonical tubes ? Ban them from the index ? They exist, people use them. We do proactively ban torrent sites and file lockers from the index, but where do we draw the line ? Should we remove the tubes from the index too ? |
I submitted a few. We'll see what happens. :)
|
URL submissions to the site are algorithmically checked and not manually checked except for a few cases.
However I do see reports on submission failure rates and there have been a lot of sites submitted that are rejected by the search engine because of too many ads or too many popups or popunders. If a surfer visits your site, clicks once and then has a popup take over their screen then your site just won't be included in the index. Likewise if you have more than 4 ads above the fold then your site also won't be included in the index. The key decisions of whether a site is included on Node.XXX are: 1. Does the site provide a good user experience ? Good 2. Does the site have too many ads ? Bad 3. Does the site have takeover popups on click ? Bad 4. Is the site spammy ? Bad 5, Does the site embed large amounts of content from other sources ? Bad 6. Is the site a white label ? Exclusion 7. Is the site an embedding tube ? Exclusion 8. Does the site have too many spammy links pointing to it ? Bad 9. Does the site have malware or unwanted redirects ? Exclusion. 10. Does the site show different content to mobile and desktop users? Exclusion Even if a site is included in the index, every time it is re-crawled these considerations apply. We also check the site from proxy nodes using various User Agents to be sure that sites aren't trying to fool Node.XXX So to sum up. Good user experience and original content will see a site included in the index. Bad user experience or spam etc will see the site excluded. |
Will surfers prefer it to Google?
|
All times are GMT -7. The time now is 02:22 PM. |
Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123