Re: Multiple Crawl database ??

Sourabh Kasliwal Sun, 21 Nov 2010 22:43:45 -0800

Hi,
Thanks for the reply..
Can you please elaborate on the crawl speed problem to some extend.

I want to run multiple crawls together in different threads(not processes)
in order to fetch pages from multiple domains in parallel and have separate
crawl database for each  domain.

I am doing it on a single machine with hadoop on standalone mode. I am
familiar with the issue that hadoop will serialize the nutch jobs. Is there
any other major problem that I should handle ?

regards
Sourabh

On Mon, Nov 22, 2010 at 11:56 AM, Paul Dhaliwal <[email protected]> wrote:

> It is possible to do multiple  separate crawls and then merge them
> together.
>
> However you might run into crawls speed problems if you dont take into
> account how http connections are managed in nutch.
>
> Hth,
> Paul
>
> On Nov 21, 2010 10:14 PM, "Sourabh Kasliwal" <[email protected]>
> wrote:
> Is it possible to have multiple crawl database in nutch. I want separate
> crawl database for each seed url. Is it possible to have the same, if not
> then is there any better alternative to running many nutch jobs in multiple
> threads.
> regards
> Sourabh
>

Re: Multiple Crawl database ??

Reply via email to