Hi, Thanks for the reply.. Can you please elaborate on the crawl speed problem to some extend.
I want to run multiple crawls together in different threads(not processes) in order to fetch pages from multiple domains in parallel and have separate crawl database for each domain. I am doing it on a single machine with hadoop on standalone mode. I am familiar with the issue that hadoop will serialize the nutch jobs. Is there any other major problem that I should handle ? regards Sourabh On Mon, Nov 22, 2010 at 11:56 AM, Paul Dhaliwal <[email protected]> wrote: > It is possible to do multiple separate crawls and then merge them > together. > > However you might run into crawls speed problems if you dont take into > account how http connections are managed in nutch. > > Hth, > Paul > > On Nov 21, 2010 10:14 PM, "Sourabh Kasliwal" <[email protected]> > wrote: > Is it possible to have multiple crawl database in nutch. I want separate > crawl database for each seed url. Is it possible to have the same, if not > then is there any better alternative to running many nutch jobs in multiple > threads. > regards > Sourabh >

