Hi,
I once observed a similar problem:
a few 1000 docs per cycle and among them a few hundred
with quite many and long outlinks. Parsing was done
in Fetcher (to avoid storing the raw content) and
the recude step took hours. The segments (namely the subdirs
containing outlinks) take the size of GBs. The number of URLs
added finally to CrawlDb was comparatively small (only a few
thousands per cycle).
db.max.outlinks.per.page was set to -1 because I must not miss any
outlink. That's the one reason. I hadn't the time to have a closer
look what exactly caused the performance issue in time and disk space.
There could be some optimization by folding the outlinks before
- either in o.a.n.parse.ParseOutputFormat.getRecordWriter.write()
- resp. Fetcher.FetcherThread.output
- or by using a combiner.
Sebastian
On 06/24/2013 05:40 PM, eakarsu wrote:
> I am running Nutch 1.6 on Hadoop 1.1.2 on 10 cluster machine.
>
> I have set topN 10,000
>
> The reduce phase of Parsing stage takes a lot of time: map stage of parsing
> is about 3 hours but reduce phase is taking 15 hours for 115 million urls.
> Even fetching time is 17 hours.
>
> The last part of log file of reduce phase is as the following. I have
> searched internet and it is suggested to minimize url length in
> regex-urlfilter.txt and put this
>
> # skip URLS longer than a certain length :350 or above
> -^.{350,}$
>
> But this did not help.
>
> Any help will be appreciated.
>
> 2013-06-24 10:17:55,150 INFO org.apache.hadoop.mapred.Merger: Down to the
> last merge-pass, with 16 segments left of total size: 6802884159 bytes
> 2013-06-24 10:17:55,158 INFO org.apache.nutch.plugin.PluginRepository:
> Plugins: looking in:
> /media/sdb/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201306231247_0004/jars/classes/plugins
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Plugin Auto-activation mode: [true]
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Registered Plugins:
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> the
> nutch core extension points (nutch-extensionpoints)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Basic URL Normalizer (urlnormalizer-basic)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Language Identification Parser/Filter (language-identifier)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Basic Indexing Filter (index-basic)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Html
> Parse Plug-in (parse-html)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> HTTP
> Framework (lib-http)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Regex URL Filter (urlfilter-regex)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Pass-through URL Normalizer (urlnormalizer-pass)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Http
> Protocol Plug-in (protocol-http)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Regex URL Normalizer (urlnormalizer-regex)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Tika
> Parser Plug-in (parse-tika)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> OPIC
> Scoring Plug-in (scoring-opic)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> CyberNeko HTML Parser (lib-nekohtml)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Anchor Indexing Filter (index-anchor)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Regex URL Filter Framework (lib-regex-filter)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Registered Extension-Points:
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> HTML
> Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Nutch Content Parser (org.apache.nutch.parse.Parser)
> 2013-06-24 10:17:55,193 INFO org.apache.nutch.plugin.PluginRepository:
> Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2013-06-24 10:17:55,298 INFO org.apache.hadoop.conf.Configuration: found
> resource regex-urlfilter.txt at
> file:/media/sdb/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201306231247_0004/jars/regex-urlfilter.txt
> 2013-06-24 10:17:55,316 INFO org.apache.hadoop.conf.Configuration: found
> resource regex-normalize.xml at
> file:/media/sdb/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201306231247_0004/jars/regex-normalize.xml
> 2013-06-24 10:17:55,360 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory:
> Successfully loaded & initialized native-zlib library
> 2013-06-24 10:17:55,376 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new compressor
> 2013-06-24 10:17:55,382 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new compressor
> 2013-06-24 10:17:55,407 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new compressor
> 2013-06-24 10:17:55,413 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new compressor
> 2013-06-24 10:17:55,454 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new compressor
> 2013-06-24 10:17:55,506 INFO
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer: can't find
> rules for scope 'outlink', using default
> 2013-06-24 10:18:11,599 INFO
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer: can't find
> rules for scope 'fetcher', using default
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Parse-reduce-stage-take-forver-tp4072755.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>