I've been completely baffled by slowness of the reduce step during parsing
and will really appreciate some insights.

The job is being run on hadoop and one of the reduce tasks always takes ages
to finish stalling the reduce progress at 99%

The slowness seems to be a function of the topN parameter and parsing gets
stuck at 99% for 2-3 hours when topN is 2,000,000



This is reproducible with several segments and here's a link to one segment
where the issue is consistently reproducible -
https://dl.dropbox.com/u/4027616/segment.tar.gz

This issue has been inconclusively discussed in several threads before:
http://lucene.472066.n3.nabble.com/Parse-reduce-slow-as-a-snail-td3296865.html
http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-td3758053.html
http://lucene.472066.n3.nabble.com/ParseSegment-slow-reduce-phase-td612119.html

One possible suggestion in the thread is slowness in normalizing and
filtering URLs before writing to disk - especially long URLs.


regex-normalize.xml is the default one

regex-urlfilter.txt removes URLS longer than 350 chars:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(PDF|pdf|mp3|MP3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# skip URLS longer than a certain length
-^.{350,}$

# accept anything else
+.

I'm using the release version of Nutch 1.4

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-Parse-Step-Bafflingly-Slow-in-Reduce-Step-with-example-tp3988820.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to