Any thoughts or explanations? This also happens in large fetch jobs on a Hadoop cluster. 300.000 URL's are downloaded within 30 minutes (incl. job set up) but then it keeps `doing something` for another hour! Sometimes tasks just fail after 10 minutes because they didn't report progress. The map is then still busy merging intermediate segments which takes a long time.
> Hi, > > When a large fetch finally finishes we see the typical -activeThreads=0 > for a long time with slightly increased RAM consumption (relative to > during the fetch) and extremely high IO-wait time. > > At first it would look like the fetch job is writing away the files it > downloaded, but i cannot be since the sum of data size is much greater than > the used, and available RAM. After a while the IO-wait drops to almost zero > and process time increases again while it's still finishing the fetch job. > At this time RAM consumption drops back to the usual during fetch. > > My question: can anyone please explain this behaviour or at least explain > what's happening when the fetcher finishes? > > Since IO-wait just stops the process the non-IO-wait time is interesting > since it may be a point of improvement. Why not do the tasks it's doing > while fetching? > > In this specific case it's about 1.4-dev running a local job and a fetcher > being limited by time. The crawl is limited to a big TLD and only takes a > few pages per host. Linux has been tuned to allow high amount of packages > (syslog doesn't mention dropping packets anymore) and a very large list of > hosts (the TLD). > > Thanks, > M.

