Any thoughts or explanations? This also happens in large fetch jobs on a 
Hadoop cluster. 300.000 URL's are downloaded within 30 minutes (incl. job set 
up) but then it keeps `doing something` for another hour! Sometimes tasks just 
fail after 10 minutes because they didn't report progress. The map is then 
still busy merging intermediate segments which takes a long time.

> Hi,
> 
> When a large fetch finally finishes we see the typical  -activeThreads=0
> for a long time with slightly increased RAM consumption (relative to
> during the fetch) and extremely high IO-wait time.
> 
> At first it would look like the fetch job is writing away the files it
> downloaded, but i cannot be since the sum of data size is much greater than
> the used, and available RAM. After a while the IO-wait drops to almost zero
> and process time increases again while it's still finishing the fetch job.
> At this time RAM consumption drops back to the usual during fetch.
> 
> My question: can anyone please explain this behaviour or at least explain
> what's happening when the fetcher finishes?
> 
> Since IO-wait just stops the process the non-IO-wait time is interesting
> since it may be a point of improvement. Why not do the tasks it's doing
> while fetching?
> 
> In this specific case it's about 1.4-dev running a local job and a fetcher
> being limited by time. The crawl is limited to a big TLD and only takes a
> few pages per host. Linux has been tuned to allow high amount of packages
> (syslog doesn't mention dropping packets anymore) and a very large list of
> hosts (the TLD).
> 
> Thanks,
> M.

Reply via email to