Initially I had the idea of developing this during Codebits, but I ended up in developing this for fun before the event, invalidating any idea of using it.
Due to the anonymity characteristics of the TOR network and the necessity of using it to access this TLD, very often this type of networks are called deep web.
The idea here was to crawl the .onion network, but instead of crawling and data mining its contents I just wanted to crawl its server’s relationships.
The stack is very simplistic, at the infrastructure level this was built using a main control node. Which runs node.js and a Redis instance.
Additionally there were multiple crawlers running a onion tweaked version of crawler4j, each crawler grabs the *.onion links in the html code and saves them (domains relationships) in Redis using Jedis. At the network level Polipo was used as a proxy and obviously tor client.
Each domain relationship is displayed in a graph which is rendered using sigma.js, all data is delivered to the browser using socket.io.
In two days, it crawled 1.5M urls finding and mapping relationships between 440 domains. Keep in mind that this crawling was done inside the tor network, which sometimes have very high latency times.
Finally here it is.