Implementing a distributed crawler

vm Mon, 01 Aug 2016 05:34:09 -0700

Hi,

I'm in the process of evaluating Ignite for use in an experimental
distributed web crawler.


I would like to avoid a master/worker architecture, and instead have each
node pulling URLs to crawl from a distributed Queue - which is populated by
the crawler instances themselves. Ignite's queue seems to be working fine
for this.

I'd like to ensure that previously seen links are never crawled more than
once. The Ignite Set sounds like the right place to start with this, but I
also wondered if just the Cache would work here? Related to this... I
wondered whether it would be possible to implement a system where URLs being
added to the system could be deterministically pushed to a node, so that the
"already visited" links could be managed by a node-local Set
(ConcurrentHashMap) instead? Or, maybe this deterministic routing of
Hash(Url) -> NodeX happens when URLs are taken off of the Queue? Of course,
if NodeX goes away due to problems, I'd need another node to take over the
processing for the same Hash(URL) values.

Of course I have performance concerns too, with too much network activity
when putting/taking the Queue - and also the potentially many checks needed
for the visited URLs.

Thank you for any thoughts or information on my questions,
VM



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Implementing-a-distributed-crawler-tp6654.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Implementing a distributed crawler

Reply via email to