> -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Tuesday, 20 December 2011 10:08 AM > To: [email protected] > Subject: Re: Runaway fetcher threads > > Hi, > > > Hi Markus, > > > > > -----Original Message----- > > > From: Markus Jelsma [mailto:[email protected]] > > > Sent: Monday, 19 December 2011 9:24 PM > > > To: [email protected] > > > Subject: Re: Runaway fetcher threads > > > > > > On Monday 19 December 2011 08:32:53 [email protected] wrote: > > > > Hi, > > > > > > > > I've observed an interesting phenomenon that is not hard to > reproduce > > > > > > and > > > > > > > that I think should not be happening: > > > > > > > > If you have N fetcher threads, inject, say, 2xN URLs of VERY > large > > > > > > files > > > > > > > plus a few smaller files to fetch and run something that uses > > > > org.apache.nutch.crawl.Crawl. The big files will take forever to > > > > > > download > > > > > > > and the threads will be killed. The process then will proceed to > the > > > > indexing stage. However, you will see fetcher threads output in > the > > > > > > logs > > > > > > > intermixed with the output of the indexer. This shows that they > were > > > > > > not > > > > > > > terminated properly (or at all?). > > > > > > Hi, what version are you running? Sounds like a old one. Can you > try > > > with a more recent version if that is the case? > > > > I am using 1.4 latest release. > > Then how can fetcher logs be `intermixed` with indexer logs? Or is this > a > local instance where you run multiple local jobs concurrently?
Yes, I am running Nutch in local mode. All output goes to one log file. But, in this file fetcher records appear after/mixed with the indexer records. This is what looks abnormal. By the time the indexer starts, the fetcher call must have returned (see the Crawl class). Evidently, some fetcher threads were left running. > > I've never seen fetcher and indexer output together in one log or part > of a > log (in that case it's running local). > > > > > > > In anyway, if this is about evenly distributing files across fetch > > > lists, this > > > cannot be based on file size as it is unknown beforehand. That is > only > > > possible when recrawling large files with a modified generator and > and > > > updater > > > that adds the Content-Length field as CrawlDatum metadata. > > > > No, this is not related to evenly distributing files across fetch > lists. > > > > > > Regards, > > > > > > > > Arkadi > > > > > > -- > > > Markus Jelsma - CTO - Openindex

