On Monday 19 December 2011 08:32:53 [email protected] wrote: > Hi, > > I've observed an interesting phenomenon that is not hard to reproduce and > that I think should not be happening: > > If you have N fetcher threads, inject, say, 2xN URLs of VERY large files > plus a few smaller files to fetch and run something that uses > org.apache.nutch.crawl.Crawl. The big files will take forever to download > and the threads will be killed. The process then will proceed to the > indexing stage. However, you will see fetcher threads output in the logs > intermixed with the output of the indexer. This shows that they were not > terminated properly (or at all?).
Hi, what version are you running? Sounds like a old one. Can you try with a more recent version if that is the case? In anyway, if this is about evenly distributing files across fetch lists, this cannot be based on file size as it is unknown beforehand. That is only possible when recrawling large files with a modified generator and and updater that adds the Content-Length field as CrawlDatum metadata. > > Regards, > > Arkadi -- Markus Jelsma - CTO - Openindex

