Hi, I'm in the process of evaluating Ignite for use in an experimental distributed web crawler.
I would like to avoid a master/worker architecture, and instead have each node pulling URLs to crawl from a distributed Queue - which is populated by the crawler instances themselves. Ignite's queue seems to be working fine for this. I'd like to ensure that previously seen links are never crawled more than once. The Ignite Set sounds like the right place to start with this, but I also wondered if just the Cache would work here? Related to this... I wondered whether it would be possible to implement a system where URLs being added to the system could be deterministically pushed to a node, so that the "already visited" links could be managed by a node-local Set (ConcurrentHashMap) instead? Or, maybe this deterministic routing of Hash(Url) -> NodeX happens when URLs are taken off of the Queue? Of course, if NodeX goes away due to problems, I'd need another node to take over the processing for the same Hash(URL) values. Of course I have performance concerns too, with too much network activity when putting/taking the Queue - and also the potentially many checks needed for the visited URLs. Thank you for any thoughts or information on my questions, VM -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Implementing-a-distributed-crawler-tp6654.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
