I have a similar need with an additional requirement whereby the crawlDB should be merged at the end. The best solution I could think of,so far, is having independent instances of nutch. Remi On Mar 14, 2015 9:08 PM, "steve labar" <[email protected]> wrote:
> Hi, > > I have a use case where I need to define schedules for crawling of certain > domains with nutch. I'm having a hard time wrapping my head around how this > would be setup. It looks to me that the way nutch is designed it runs with > a single instance that can in itself handle a huge number of hosts. > > So let's say I have three organizations who I will be crawling their sites. > Each organization will have their own set of seeds, configurations, and > start and stop times of active crawling. Conceivably each of these three > organizations would have their own crawl jobs that get fired up based on > the organizations defined schedules. Therefore, it is possible that two or > more jobs will be running at the same time. Is this something that can be > setup? > Thank you, >

