Hi, > Hi Markus, > > > -----Original Message----- > > From: Markus Jelsma [mailto:[email protected]] > > Sent: Monday, 19 December 2011 9:24 PM > > To: [email protected] > > Subject: Re: Runaway fetcher threads > > > > On Monday 19 December 2011 08:32:53 [email protected] wrote: > > > Hi, > > > > > > I've observed an interesting phenomenon that is not hard to reproduce > > > > and > > > > > that I think should not be happening: > > > > > > If you have N fetcher threads, inject, say, 2xN URLs of VERY large > > > > files > > > > > plus a few smaller files to fetch and run something that uses > > > org.apache.nutch.crawl.Crawl. The big files will take forever to > > > > download > > > > > and the threads will be killed. The process then will proceed to the > > > indexing stage. However, you will see fetcher threads output in the > > > > logs > > > > > intermixed with the output of the indexer. This shows that they were > > > > not > > > > > terminated properly (or at all?). > > > > Hi, what version are you running? Sounds like a old one. Can you try > > with a more recent version if that is the case? > > I am using 1.4 latest release.
Then how can fetcher logs be `intermixed` with indexer logs? Or is this a local instance where you run multiple local jobs concurrently? I've never seen fetcher and indexer output together in one log or part of a log (in that case it's running local). > > > In anyway, if this is about evenly distributing files across fetch > > lists, this > > cannot be based on file size as it is unknown beforehand. That is only > > possible when recrawling large files with a modified generator and and > > updater > > that adds the Content-Length field as CrawlDatum metadata. > > No, this is not related to evenly distributing files across fetch lists. > > > > Regards, > > > > > > Arkadi > > > > -- > > Markus Jelsma - CTO - Openindex

