I am running Nutch 1.6 on Hadoop 1.1.2 on 10 cluster machine.
I have set topN 10,000
The reduce phase of Parsing stage takes a lot of time: map stage of parsing
is about 3 hours but reduce phase is taking 15 hours for 115 million urls.
Even fetching time is 17 hours.
The last part of log file of reduce phase is as the following. I have
searched internet and it is suggested to minimize url length in
regex-urlfilter.txt and put this
# skip URLS longer than a certain length :350 or above
-^.{350,}$
But this did not help.
Any help will be appreciated.
2013-06-24 10:17:55,150 INFO org.apache.hadoop.mapred.Merger: Down to the
last merge-pass, with 16 segments left of total size: 6802884159 bytes
2013-06-24 10:17:55,158 INFO org.apache.nutch.plugin.PluginRepository:
Plugins: looking in:
/media/sdb/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201306231247_0004/jars/classes/plugins
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Plugin Auto-activation mode: [true]
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Registered Plugins:
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository: the
nutch core extension points (nutch-extensionpoints)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Basic URL Normalizer (urlnormalizer-basic)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Language Identification Parser/Filter (language-identifier)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Basic Indexing Filter (index-basic)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository: Html
Parse Plug-in (parse-html)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository: HTTP
Framework (lib-http)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Regex URL Filter (urlfilter-regex)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Pass-through URL Normalizer (urlnormalizer-pass)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository: Http
Protocol Plug-in (protocol-http)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Regex URL Normalizer (urlnormalizer-regex)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository: Tika
Parser Plug-in (parse-tika)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository: OPIC
Scoring Plug-in (scoring-opic)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
CyberNeko HTML Parser (lib-nekohtml)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Anchor Indexing Filter (index-anchor)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Regex URL Filter Framework (lib-regex-filter)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Registered Extension-Points:
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Nutch Protocol (org.apache.nutch.protocol.Protocol)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Nutch URL Filter (org.apache.nutch.net.URLFilter)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository: HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Nutch Content Parser (org.apache.nutch.parse.Parser)
2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2013-06-24 10:17:55,298 INFO org.apache.hadoop.conf.Configuration: found
resource regex-urlfilter.txt at
file:/media/sdb/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201306231247_0004/jars/regex-urlfilter.txt
2013-06-24 10:17:55,316 INFO org.apache.hadoop.conf.Configuration: found
resource regex-normalize.xml at
file:/media/sdb/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201306231247_0004/jars/regex-normalize.xml
2013-06-24 10:17:55,360 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory:
Successfully loaded & initialized native-zlib library
2013-06-24 10:17:55,376 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2013-06-24 10:17:55,382 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2013-06-24 10:17:55,407 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2013-06-24 10:17:55,413 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2013-06-24 10:17:55,454 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2013-06-24 10:17:55,506 INFO
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer: can't find
rules for scope 'outlink', using default
2013-06-24 10:18:11,599 INFO
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer: can't find
rules for scope 'fetcher', using default
--
View this message in context:
http://lucene.472066.n3.nabble.com/Parse-reduce-stage-take-forver-tp4072755.html
Sent from the Nutch - User mailing list archive at Nabble.com.