Re: Scheduling multiple possibly parallel nutch crawls based on different configurations?

remi tassing Sun, 15 Mar 2015 00:39:36 -0700

I have a similar need with an additional requirement whereby the crawlDB
should be merged at the end.
The best solution I could think of,so far, is having independent instances
of nutch.
Remi
On Mar 14, 2015 9:08 PM, "steve labar" <[email protected]>
wrote:


> Hi,
>
> I have a use case where I need to define schedules for crawling of certain
> domains with nutch. I'm having a hard time wrapping my head around how this
> would be setup. It looks to me that the way nutch is designed it runs with
> a single instance that can in itself handle a huge number of hosts.
>
> So let's say I have three organizations who I will be crawling their sites.
> Each organization will have their own set of seeds, configurations, and
> start and stop times of active crawling. Conceivably each of these three
> organizations would have their own crawl jobs that get fired up based on
> the organizations defined schedules. Therefore, it is possible that two or
> more jobs will be running at the same time. Is this something that can be
> setup?
> Thank you,
>

Re: Scheduling multiple possibly parallel nutch crawls based on different configurations?

Reply via email to