Hi, I'm working on prototyping a web crawler using Ignite as the crawl-db. I'd like to ensure the crawler obey's the appropriate Craw-Delay time as set in a site's robots.txt file - the way I have this setup now, is by submitting "candidates" to an Ignite cache. A local listener is setup to receive successfully persisted items, which then submits the items to a queue for a fetcher to pull from.
Goal: Support a delay time + maximum fetch concurrency, per-host, per-item. Put another way: "for each fetch item, ensure that requests made to the associated host are delayed as required, and no more than n-requests are made during each delayed run". This could be modeled as a Map<Host,DelayQueue> or maybe even a by using ScheduledExecutorService where each task represents a host, and is repeated according to the delay time. I'd like to prevent items from being put into the java work queue if they are not yet ready to be fetched, and I'm slightly worried about the potential number of hosts (in reference to the java Map<Host,...> data-structure). So my question is: is there something that Ignite can provide for making this all work? - Matt -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
