Ah, i seem to have found what i'm looking for: io.sort.factor=K is the lever and Finished Spill=N is the indicator.
With large map output, such as produced by the fetcher, we need to tune the systems to get N down as much as possible by increasing K. There's probably a sweet spot for our scenario but i haven't figured it all out yet. The number of hanging threads is something i can tune in Nutch but when Nutch finishes Hadoop takes over and the merge passes with a high number of segments takes (on my systems) longer than 10 minutes (map time out). During this phase i cannot send progress information to the tracker to prevent it from killing the task if time has run out so you either increase time out or reduce the time it's spending: the io.sort.factor and io.sort.mb magic and watch your ulimit -n. > Any thoughts or explanations? This also happens in large fetch jobs on a > Hadoop cluster. 300.000 URL's are downloaded within 30 minutes (incl. job > set up) but then it keeps `doing something` for another hour! Sometimes > tasks just fail after 10 minutes because they didn't report progress. The > map is then still busy merging intermediate segments which takes a long > time. > > > Hi, > > > > When a large fetch finally finishes we see the typical -activeThreads=0 > > for a long time with slightly increased RAM consumption (relative to > > during the fetch) and extremely high IO-wait time. > > > > At first it would look like the fetch job is writing away the files it > > downloaded, but i cannot be since the sum of data size is much greater > > than the used, and available RAM. After a while the IO-wait drops to > > almost zero and process time increases again while it's still finishing > > the fetch job. At this time RAM consumption drops back to the usual > > during fetch. > > > > My question: can anyone please explain this behaviour or at least explain > > what's happening when the fetcher finishes? > > > > Since IO-wait just stops the process the non-IO-wait time is interesting > > since it may be a point of improvement. Why not do the tasks it's doing > > while fetching? > > > > In this specific case it's about 1.4-dev running a local job and a > > fetcher being limited by time. The crawl is limited to a big TLD and > > only takes a few pages per host. Linux has been tuned to allow high > > amount of packages (syslog doesn't mention dropping packets anymore) and > > a very large list of hosts (the TLD). > > > > Thanks, > > M.

