Hi,

> Hi Markus,
> 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]]
> > Sent: Monday, 19 December 2011 9:24 PM
> > To: [email protected]
> > Subject: Re: Runaway fetcher threads
> > 
> > On Monday 19 December 2011 08:32:53 [email protected] wrote:
> > > Hi,
> > > 
> > > I've observed an interesting phenomenon that is not hard to reproduce
> > 
> > and
> > 
> > > that I think should not be happening:
> > > 
> > > If you have N fetcher threads, inject, say, 2xN URLs of VERY large
> > 
> > files
> > 
> > > plus a few smaller files to fetch and run something that uses
> > > org.apache.nutch.crawl.Crawl. The big files will take forever to
> > 
> > download
> > 
> > > and the threads will be killed. The process then will proceed to the
> > > indexing stage. However, you will see fetcher threads output in the
> > 
> > logs
> > 
> > > intermixed with the output of the indexer. This shows that they were
> > 
> > not
> > 
> > > terminated properly (or at all?).
> > 
> > Hi, what version are you running? Sounds like a old one. Can you try
> > with a more recent version if that is the case?
> 
> I am using 1.4 latest release.

Then how can fetcher logs be `intermixed` with indexer logs? Or is this a 
local instance where you run multiple local jobs concurrently?

I've never seen fetcher and indexer output together in one log or part of a 
log (in that case it's running local).


> 
> > In anyway, if this is about evenly distributing files across fetch
> > lists, this
> > cannot be based on file size as it is unknown beforehand. That is only
> > possible when recrawling large files with a modified generator and and
> > updater
> > that adds the Content-Length field as CrawlDatum metadata.
> 
> No, this is not related to evenly distributing files across fetch lists.
> 
> > > Regards,
> > > 
> > > Arkadi
> > 
> > --
> > Markus Jelsma - CTO - Openindex

Reply via email to