GoFuckYourself.com - Adult Webmaster Forum

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)
-   Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)
-   -   Tech Node.XXX - Porn Search Engine (https://gfy.com/showthread.php?t=1208011)

AdultKing 07-19-2016 07:54 AM

Node.XXX - Porn Search Engine
 
A little project I have been working on is https://node.xxx

It's a search engine that only indexes adult websites and aggressively deals with spam sites, preventing them from being indexed.

It will also only index canonical sites, so no white labels get into the search index.

At the moment it's a little slow to respond to queries but that will improve as new caching servers are deployed.

It's still got a way to go in development but it is live and the current index is around 100 million pages. It supports complex queries which are documented here.

The search engine is infinitely scalable and while it's currently crawling html, pdf, json, xml, rss, video and images it's only returning text based results right now.

Later this year once I've perfected image search that will be rolled out, with video search to follow around March 2017.

Have a look and let me know what you think at https://node.xxx

Jigster715 07-19-2016 08:47 AM

How do we submit sites for indexing?

AdultKing 07-19-2016 08:50 AM

Quote:

Originally Posted by Jigster715 (Post 21043810)
How do we submit sites for indexing?

To add a site you need to register for an account and then use the Add Site feature in the user dashboard.

Barry-xlovecam 07-19-2016 08:52 AM

Looks good -- when you get decent traffic hit us up about buying some ads:thumbsup

Jigster715 07-19-2016 08:52 AM

Quote:

Originally Posted by AdultKing (Post 21043816)
To add a site you need to register for an account and then use the Add Site feature in the user dashboard.

Ah, ok. I will do that. So far, it is loading sites that scrape our sites and not the real sites.

AdultKing 07-19-2016 08:53 AM

Quote:

Originally Posted by Jigster715 (Post 21043828)
Ah, ok. I will do that. So far, it is loading sites that scrape our sites and not the real sites.

PM me the domains, that shouldn't be happening.

AdultKing 07-19-2016 08:54 AM

Quote:

Originally Posted by Barry-xlovecam (Post 21043825)
Looks good -- when you get decent traffic hit us up about buying some ads:thumbsup

Right now the main focus is on engineering. The goal is to get a search down to 1.2 seconds or less. At the moment it's a bit slow - but that will improve as things are refined.

Klen 07-19-2016 09:05 AM

So it shows only sites which are manually submitted to it ?

AdultKing 07-19-2016 09:10 AM

Quote:

Originally Posted by KlenTelaris (Post 21043855)
So it shows only sites which are manually submitted to it ?

No. It discovers sites automatically. However crawling the web isn't trivial, so the number of domains currently indexed is relatively small. Lots of sites discovered won't end up in the index. Examples of sites that the search engine will exclude are white label sites, mass embed tube sites (such as sites that just embed videos from the main tubes). Spammy sites are excluded and if a site has too many popups or any kind of sneaky redirects then they won't get indexed either.

There are a lot of crap sites on the adult web and the focus of this search engine is to only index sites of a certain quality. It's not perfect yet, but it's getting better all the time.

Klen 07-19-2016 09:31 AM

Quote:

Originally Posted by AdultKing (Post 21043858)
No. It discovers sites automatically. However crawling the web isn't trivial, so the number of domains currently indexed is relatively small. Lots of sites discovered won't end up in the index. Examples of sites that the search engine will exclude are white label sites, mass embed tube sites (such as sites that just embed videos from the main tubes). Spammy sites are excluded and if a site has too many popups or any kind of sneaky redirects then they won't get indexed either.

There are a lot of crap sites on the adult web and the focus of this search engine is to only index sites of a certain quality. It's not perfect yet, but it's getting better all the time.

How exactly you will determine which one is real tube and which was is embed tube ?

redwhiteandblue 07-19-2016 09:31 AM

What's the UA of the crawler so I can whitelist it?

AdultKing 07-19-2016 09:33 AM

Quote:

Originally Posted by redwhiteandblue (Post 21043909)
What's the UA of the crawler so I can whitelist it?

NodeBot 1.0/G

teg0 07-19-2016 10:06 AM

Nice work. I'm working on something similar, but different. My own twists.

Nicky 07-19-2016 10:08 AM

I regged and added some sites

AdultKing 07-19-2016 10:13 AM

Quote:

Originally Posted by teg0 (Post 21043957)
Nice work. I'm working on something similar, but different. My own twists.

It's an expensive exercise rolling out a search engine.

If anyone is interested in learning more about how it all works, I have a dedicated Node support channel in my slack team. Just visit Join the Adult Industry community on Slack! to get an auto invite.

I'm happy to answer questions and get into technical detail about how it all works and how I've built out the architecture.

Bladewire 07-19-2016 10:20 AM

I think it's great that you're always working on something new let's hope this one sticks and does well :thumbsup

Serge Litehead 07-19-2016 10:33 AM

we did something like that back in 04-06.

search engine is a huge engineering and expensive project.

AdultKing, even if you get your queries time down to half it is still slow. At some point working on our SE project we decided to dump MYSQL and had written out own db engine which were way efficient than mysql. For instance 30mg db in mysql only weighted 1.4mg in our engine, querying was ridiculously fast in speed too no matter how huge database was due to our own indexing tech, we could show all results not up to 1000 like every other SE did and does.

Good memories and definitely great experience. Our development took ~1.5 years between myself and another programmer working 12-16hrs a day no weekends.

The project was wrapped up due to lack of financing, we got engine ready out of beta and were developing webmaster area for buying ads getting ready for marketing when it got stalled.
it was written in delphi/php.

AdultKing 07-19-2016 10:36 AM

Quote:

Originally Posted by Bladewire (Post 21044005)
I think it's great that you're always working on something new let's hope this one sticks and does well :thumbsup

Time will tell.

It's not going to stick if query times take as long as they do now.

Current average time for results to be returned is 5 seconds. I need to get it down to 1.2 seconds max. Otherwise people just won't use it.

There's also the challenge of ensuring that the index remains as spam free as possible.

I've been working on this project for quite a long time and even launched a search engine years ago which didn't stick - the problems with that were the limitations of processing power and storage - now things are better with better infrastructure options available.

AdultKing 07-19-2016 10:42 AM

Quote:

Originally Posted by holograph (Post 21044041)
we did something like that back in 04-06.

search engine is a huge engineering and expensive project.

AdultKing, even if you get your queries time down to half it is still slow. At some point working our SE project we decided to dump MYSQL and had written out own db engine which were way efficient than mysql.

It's terribly expensive. Node is a cluster of 16 nodes at the moment and I'm adding another 16 this week.

The architecture is all NoSQL, the crawler and search engine are written in C and borrow some of the concepts, but not the code, of Lucene. The ranking algorithm is adaptive and reprocesses the index twice a day.

I have development group of servers running where I am tuning the search portion and currently have results within 1.8 seconds max, but I think 1.2 seconds is the sweet spot to make the thing usable.

bns666 07-19-2016 10:47 AM

nice, good luck :thumbsup

Serge Litehead 07-19-2016 10:53 AM

we had dynamically updated index cache for our search results
along crawler bots we had bots doing indexing (results caching to be precise), which were updating all relevant indexes for a new page for existing search terms, this way indexes were always uptodate and results displayed very quickly.

AdultKing 07-19-2016 10:59 AM

Quote:

Originally Posted by holograph (Post 21044113)
we had dynamically updated index cache for our search results
along crawler bots we had bots doing indexing (results caching to be precise), which were updating all relevant indexes for a new page for existing search terms, this way indexes were always uptodate and results displayed very quickly.

I've got search caching built in but it's off at the moment while I work out some infrastructure issues.

The main reason I announced it on GFY tonight was in the hope that people could break it :)

Adnium_Ivana 07-19-2016 11:24 AM

Quote:

Originally Posted by AdultKing (Post 21043684)
A little project I have been working on is https://node.xxx

It's a search engine that only indexes adult websites and aggressively deals with spam sites, preventing them from being indexed.

It will also only index canonical sites, so no white labels get into the search index.

At the moment it's a little slow to respond to queries but that will improve as new caching servers are deployed.

It's still got a way to go in development but it is live and the current index is around 100 million pages. It supports complex queries which are documented here.

The search engine is infinitely scalable and while it's currently crawling html, pdf, json, xml, rss, video and images it's only returning text based results right now.

Later this year once I've perfected image search that will be rolled out, with video search to follow around March 2017.

Have a look and let me know what you think at https://node.xxx

Not an adult site but just tried searching for some and a) the speed in which the search came up is pretty impressive and b) I even found our ad network (Adnium & GSM) indexed on xbiz.com. Pretty impressive stuff you got going on here :thumbsup

teg0 07-19-2016 11:55 AM

Quote:

Originally Posted by AdultKing (Post 21043975)
It's an expensive exercise rolling out a search engine.

If anyone is interested in learning more about how it all works, I have a dedicated Node support channel in my slack team. Just visit Join the Adult Industry community on Slack! to get an auto invite.

I'm happy to answer questions and get into technical detail about how it all works and how I've built out the architecture.

cool, joined

CaptainHowdy 07-19-2016 11:57 AM

Very nice, AK :thumbsup!

johnnyloadproductions 07-19-2016 07:40 PM

Open source has become very powerful, as long as you know how to plug and play libraries together you can do a lot as just a single developer.

sandman! 07-19-2016 08:02 PM

looks good :thumbsup:thumbsup:thumbsup

TheMaster 07-20-2016 07:53 AM

looking good, when image and video gets added, I think that's when people start using it

AdultKing 07-20-2016 08:00 AM

Quote:

Originally Posted by TheMaster (Post 21046081)
looking good, when image and video gets added, I think that's when people start using it

Yep. But baby steps first.

Current priority is to speed up search results. :thumbsup

rabbit 07-20-2016 08:57 AM

how do you rank the results? seems arbitrary... resource pages show up before homepage, etc.

AdultKing 07-20-2016 09:02 AM

Quote:

Originally Posted by rabbit (Post 21046282)
how do you rank the results? seems arbitrary... resource pages show up before homepage, etc.

Obviously I won't be providing the precise method of ranking results however the reason you see what you're seeing is that weighting on brand home pages is turned off at the moment. When I turn it on you'll see root domains of brands appear at the top of results.

Hazlewood 07-20-2016 09:35 AM

Quote:

Originally Posted by AdultKing (Post 21043684)
A little project I have been working on is https://node.xxx

It's a search engine that only indexes adult websites and aggressively deals with spam sites, preventing them from being indexed.

It will also only index canonical sites, so no white labels get into the search index.

At the moment it's a little slow to respond to queries but that will improve as new caching servers are deployed.

It's still got a way to go in development but it is live and the current index is around 100 million pages. It supports complex queries which are documented here.

The search engine is infinitely scalable and while it's currently crawling html, pdf, json, xml, rss, video and images it's only returning text based results right now.

Later this year once I've perfected image search that will be rolled out, with video search to follow around March 2017.

Have a look and let me know what you think at https://node.xxx


can you please email me to discuss something higher level. haze at grandslammedia.com

freecartoonporn 07-20-2016 09:45 AM

what search engine you guys using ?

i used sphinx before but it uses cpu a lot. and i had only ~7 mil records

AdultKing 07-20-2016 09:52 AM

Quote:

Originally Posted by Hazlewood (Post 21046420)
can you please email me to discuss something higher level. haze at grandslammedia.com

There's a channel for live discussion of Node.XXX at Adult Industry Slack Team

Hazlewood 07-20-2016 09:53 AM

Quote:

Originally Posted by AdultKing (Post 21046468)
There's a channel for live discussion of Node.XXX at Adult Industry Slack Team

I dont want to join with my email. I want to discuss business with you in a private setting. Give me your details then. Thank you

AdultKing 07-20-2016 09:55 AM

Quote:

Originally Posted by freecartoonporn (Post 21046441)
what search engine you guys using ?

i used sphinx before but it uses cpu a lot. and i had only ~7 mil records

Sphinx won't do the job.

This is clustered nodes of individual components. Crawling is seperate from Indexing. Ranking is seperate from Indexing. Crawls are performed through caching proxies on their own servers.

You could do this with Nutch and ElasticSearch but the overhead would be much greater than this system has.

freecartoonporn 07-20-2016 11:41 AM

Quote:

Originally Posted by AdultKing (Post 21046486)
Sphinx won't do the job.

This is clustered nodes of individual components. Crawling is seperate from Indexing. Ranking is seperate from Indexing. Crawls are performed through caching proxies on their own servers.

You could do this with Nutch and ElasticSearch but the overhead would be much greater than this system has.

so you made custom search engine ? i mean not using sphinx / lucene ?

just curious. as search engine site is in my todo list .

Struggle4Bucks 07-20-2016 12:07 PM

I was searching for "quick fuck" but the load was too slow...

JJE 07-20-2016 09:01 PM

A while ago I was involved in developing a search interface that had a massive index. From memory it was nearly 300m documents including web pages, social media posts, etc. Was mainly text.

Given that we had limited hardware (and hardware was less powerful than it is now) we primarily focused our attention on the crawling/indexing method. We tried to do as much processing there as we could that was adaptable and could be 'redone' on tweaks. From there we were able to shard data accordingly and with significant focus on parsing search input we were able to 'avoid' querying data that wasn't relevant to the search input. There was of course a fail-over that could be triggered and we offered supplementary results, where our algorithm wasn't sure it could query the whole index.

Our benchmark for maintaining quality was this: get the first 1,000 results for the sharded/highly processed method within 90% similarity as if the full index was being queried, and we did. So essentially, we were returning nearly identical results whilst only hitting in some cases 1-2% of the index. For reference a 1m sized query result was well under 0.5s. Most searches were basically instant. Not sure if this is of any help to you, you're probably already doing or considering these methods. Good luck.

AdultKing 07-20-2016 09:08 PM

Quote:

Originally Posted by JJE (Post 21048313)
A while ago I was involved in developing a search interface that had a massive index. From memory it was nearly 300m documents including web pages, social media posts, etc. Was mainly text.

Given that we had limited hardware (and hardware was less powerful than it is now) we primarily focused our attention on the crawling/indexing method. We tried to do as much processing there as we could that was adaptable and could be 'redone' on tweaks. From there we were able to shard data accordingly and with significant focus on parsing search input we were able to 'avoid' querying data that wasn't relevant to the search input. There was of course a fail-over that could be triggered and we offered supplementary results, where our algorithm wasn't sure it could query the whole index.

Our benchmark for maintaining quality was this: get the first 1,000 results for the sharded/highly processed method within 90% similarity as if the full index was being queried, and we did. So essentially, we were returning nearly identical results whilst only hitting in some cases 1-2% of the index. For reference a 1m sized query result was well under 0.5s. Most searches were basically instant. Not sure if this is of any help to you, you're probably already doing or considering these methods. Good luck.


Thanks for the great post.

I had a bit of a hackathon with a couple of the other people helping me last night. We have managed to get query time down to an average of .9 seconds down from an average of 5.

We're using SERP caching methods as well as a refined index, so now we've started a full recrawl.

I'll be rolling out the new version of the search engine in the next day or two.

AdultKing 07-23-2016 08:37 AM

As promised a couple of days ago a new version has been rolled out.

Tonight I activated the new Search servers and caching.

The index had to be rebuilt to accommodate the changes we made to indexing.

Average search time is now down from 5 seconds to 0.9 seconds.

Submitting Sites

A number of webmasters have been submitting sites that just wont be indexed. In most of these cases the crawler has already been over your site before submission and discarded it for the reasons below.

Node.XXX only indexes canonical content.

Node.XXX will NOT index embed tube sites, white label dating or white label cam sites.

If your site has more than one pop up , malware, more than 4 ads above the fold or a poor user experience it will not be indexed.

Node.XXX is all about quality sites. The search engine is designed to filter out sites that are bad for the surfer. So if you have a lot of ads or have more than one popup or a sneaky redirect then your site just won't be indexed.

Klen 07-24-2016 09:51 AM

What about other equation of formula? Meaning , howmuch money do you plan to spend on marketing or you will do just guerrilla marketing ?

AdultKing 07-24-2016 10:01 AM

Quote:

Originally Posted by KlenTelaris (Post 21056383)
What about other equation of formula? Meaning , howmuch money do you plan to spend on marketing or you will do just guerrilla marketing ?

There's a plan to market it, but the tech needs to be refined first. It's going to be a long, expensive process to get there. Suffice to say I'm not throwing money at this idea for the fun of it.

Smut-Talk 07-25-2016 08:27 AM

Nice project!

I tried it.
My Comment:
you need to filter www/http and subdomains...
got the exact same search results from different subdomains from tube8,
jp. and .de and www.

pretty nice though.

Reminds me a bit of Free Porn Search Engine :: pornharmony.com
that site does some awesome matching content search.
The longer you search, the better the results are.

Goodluck!

pornguy 07-25-2016 08:47 AM

Sad to see another SE loaded with tubes.

AdultKing 07-25-2016 09:56 AM

Quote:

Originally Posted by Smut-Talk (Post 21058381)
Nice project!

I tried it.
My Comment:
you need to filter www/http and subdomains...
got the exact same search results from different subdomains from tube8,
jp. and .de and www.

pretty nice though.

Actually that does happen, but it takes a while for the searchable index to catch up. You'll find that those results will disappear in a few days and you'll only get results relative to your location - eg if you're in japan you'll see jp.whatever and in the us or global you'll see domain.com



Quote:

Reminds me a bit of Free Porn Search Engine :: pornharmony.com
that site does some awesome matching content search.
The longer you search, the better the results are.

Goodluck!
Thanks.

AdultKing 07-25-2016 10:03 AM

Quote:

Originally Posted by pornguy (Post 21058390)
Sad to see another SE loaded with tubes.

We don't index embedding tubes, or the spammy crap tubes that don't host their own content.

What should we do with the canonical tubes ? Ban them from the index ? They exist, people use them. We do proactively ban torrent sites and file lockers from the index, but where do we draw the line ? Should we remove the tubes from the index too ?

HowlingWulf 07-25-2016 11:46 AM

I submitted a few. We'll see what happens. :)

AdultKing 07-25-2016 12:04 PM

URL submissions to the site are algorithmically checked and not manually checked except for a few cases.

However I do see reports on submission failure rates and there have been a lot of sites submitted that are rejected by the search engine because of too many ads or too many popups or popunders.

If a surfer visits your site, clicks once and then has a popup take over their screen then your site just won't be included in the index. Likewise if you have more than 4 ads above the fold then your site also won't be included in the index.

The key decisions of whether a site is included on Node.XXX are:

1. Does the site provide a good user experience ? Good
2. Does the site have too many ads ? Bad
3. Does the site have takeover popups on click ? Bad
4. Is the site spammy ? Bad
5, Does the site embed large amounts of content from other sources ? Bad
6. Is the site a white label ? Exclusion
7. Is the site an embedding tube ? Exclusion
8. Does the site have too many spammy links pointing to it ? Bad
9. Does the site have malware or unwanted redirects ? Exclusion.
10. Does the site show different content to mobile and desktop users? Exclusion

Even if a site is included in the index, every time it is re-crawled these considerations apply. We also check the site from proxy nodes using various User Agents to be sure that sites aren't trying to fool Node.XXX

So to sum up. Good user experience and original content will see a site included in the index. Bad user experience or spam etc will see the site excluded.

Paul Markham 07-25-2016 10:51 PM

Will surfers prefer it to Google?


All times are GMT -7. The time now is 02:22 PM.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123