Hi Markus, > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Monday, 19 December 2011 9:24 PM > To: [email protected] > Subject: Re: Runaway fetcher threads > > > > On Monday 19 December 2011 08:32:53 [email protected] wrote: > > Hi, > > > > I've observed an interesting phenomenon that is not hard to reproduce > and > > that I think should not be happening: > > > > If you have N fetcher threads, inject, say, 2xN URLs of VERY large > files > > plus a few smaller files to fetch and run something that uses > > org.apache.nutch.crawl.Crawl. The big files will take forever to > download > > and the threads will be killed. The process then will proceed to the > > indexing stage. However, you will see fetcher threads output in the > logs > > intermixed with the output of the indexer. This shows that they were > not > > terminated properly (or at all?). > > Hi, what version are you running? Sounds like a old one. Can you try > with a more recent version if that is the case?
I am using 1.4 latest release. > > In anyway, if this is about evenly distributing files across fetch > lists, this > cannot be based on file size as it is unknown beforehand. That is only > possible when recrawling large files with a modified generator and and > updater > that adds the Content-Length field as CrawlDatum metadata. No, this is not related to evenly distributing files across fetch lists. > > > > > Regards, > > > > Arkadi > > -- > Markus Jelsma - CTO - Openindex

