I've been completely baffled by slowness of the reduce step during parsing and will really appreciate some insights.
The job is being run on hadoop and one of the reduce tasks always takes ages to finish stalling the reduce progress at 99% The slowness seems to be a function of the topN parameter and parsing gets stuck at 99% for 2-3 hours when topN is 2,000,000 This is reproducible with several segments and here's a link to one segment where the issue is consistently reproducible - https://dl.dropbox.com/u/4027616/segment.tar.gz This issue has been inconclusively discussed in several threads before: http://lucene.472066.n3.nabble.com/Parse-reduce-slow-as-a-snail-td3296865.html http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-td3758053.html http://lucene.472066.n3.nabble.com/ParseSegment-slow-reduce-phase-td612119.html One possible suggestion in the thread is slowness in normalizing and filtering URLs before writing to disk - especially long URLs. regex-normalize.xml is the default one regex-urlfilter.txt removes URLS longer than 350 chars: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(PDF|pdf|mp3|MP3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. #-[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ # skip URLS longer than a certain length -^.{350,}$ # accept anything else +. I'm using the release version of Nutch 1.4 -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Parse-Step-Bafflingly-Slow-in-Reduce-Step-with-example-tp3988820.html Sent from the Nutch - User mailing list archive at Nabble.com.

