Re: When can the Nutch MapReduce job be considered complete?

S.L Tue, 11 Mar 2014 11:42:35 -0700

Markus,

Coming back to your response , my depth is 1000 as well , it's the topN
that is 30000 , can you please mention what values you use for depth and
topN again?

In my single node YARN cluster I see that the nutch does not crawl any more
documents after a certain while while the job is still executing on YARN ,
and I keep getting logs of the similar type below , not sure what to infer
from these .

2014-03-11 14:34:21,890 WARN [main]
org.apache.hadoop.conf.Configuration: job.xml:an attempt to override
final parameter: mapreduce.job.end-notification.max.retry.interval;
Ignoring.
2014-03-11 14:34:21,917 WARN [main]
org.apache.hadoop.conf.Configuration: job.xml:an attempt to override
final parameter: mapreduce.job.end-notification.max.attempts;
Ignoring.
2014-03-11 14:34:21,939 WARN [main]
org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
2014-03-11 14:34:22,171 INFO [main]
org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
hadoop-metrics2.properties
2014-03-11 14:34:22,249 INFO [main]
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
period at 10 second(s).
2014-03-11 14:34:22,249 INFO [main]
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics
system started
2014-03-11 14:34:22,260 INFO [main]
org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2014-03-11 14:34:22,304 INFO [main]
org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service:
job_1394515173627_0047, Ident:
(org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@6226d537)
2014-03-11 14:34:22,386 INFO [main]
org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying
again. Got null now.
2014-03-11 14:34:22,908 INFO [main]
org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for
child: 
/tmp/hadoop-df/nm-local-dir/usercache/df/appcache/application_1394515173627_0047
2014-03-11 14:34:23,133 WARN [main]
org.apache.hadoop.conf.Configuration: job.xml:an attempt to override
final parameter: mapreduce.job.end-notification.max.retry.interval;
Ignoring.
2014-03-11 14:34:23,148 WARN [main]
org.apache.hadoop.conf.Configuration: job.xml:an attempt to override
final parameter: mapreduce.job.end-notification.max.attempts;
Ignoring.
2014-03-11 14:34:23,545 INFO [main]
org.apache.hadoop.conf.Configuration.deprecation: session.id is
deprecated. Instead, use dfs.metrics.session-id
2014-03-11 14:34:24,041 INFO [main] org.apache.hadoop.mapred.Task:
Using ResourceCalculatorProcessTree : [ ]
2014-03-11 14:34:24,253 INFO [main] org.apache.hadoop.mapred.MapTask:
Processing split:
hdfs://localhost:9000/user/df/crawldirectory/crawldb/current/part-00000/data:134217728+134217728
2014-03-11 14:34:24,335 INFO [main] org.apache.hadoop.mapred.MapTask:
numReduceTasks: 1
2014-03-11 14:34:24,343 INFO [main] org.apache.hadoop.mapred.MapTask:
Map output collector class =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2014-03-11 14:34:24,380 INFO [main] org.apache.hadoop.mapred.MapTask:
(EQUATOR) 0 kvi 26214396(104857584)
2014-03-11 14:34:24,381 INFO [main] org.apache.hadoop.mapred.MapTask:
mapreduce.task.io.sort.mb: 100
2014-03-11 14:34:24,381 INFO [main] org.apache.hadoop.mapred.MapTask:
soft limit at 83886080
2014-03-11 14:34:24,381 INFO [main] org.apache.hadoop.mapred.MapTask:
bufstart = 0; bufvoid = 104857600
2014-03-11 14:34:24,381 INFO [main] org.apache.hadoop.mapred.MapTask:
kvstart = 26214396; length = 6553600
2014-03-11 14:34:24,398 INFO [main]
org.apache.nutch.plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-df/nm-local-dir/usercache/df/appcache/application_1394515173627_0047/filecache/10/job.jar/classes/plugins
2014-03-11 14:34:24,754 INFO [main]
org.apache.nutch.plugin.PluginRepository: Plugin Auto-activation mode:
[true]
2014-03-11 14:34:24,754 INFO [main]
org.apache.nutch.plugin.PluginRepository: Registered Plugins:
2014-03-11 14:34:24,754 INFO [main]
org.apache.nutch.plugin.PluginRepository:       the nutch core extension
points (nutch-extensionpoints)
2014-03-11 14:34:24,754 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Basic URL Normalizer
(urlnormalizer-basic)
2014-03-11 14:34:24,754 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Html Parse Plug-in
(parse-html)
2014-03-11 14:34:24,754 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Basic Indexing Filter
(index-basic)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       HTTP Framework (lib-http)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       More Indexing Filter
(index-more)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Pass-through URL Normalizer
(urlnormalizer-pass)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Regex URL Filter
(urlfilter-regex)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Http Protocol Plug-in
(protocol-http)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Regex URL Normalizer
(urlnormalizer-regex)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       CyberNeko HTML Parser
(lib-nekohtml)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Tika Parser Plug-in
(parse-tika)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Anchor Indexing Filter
(index-anchor)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Regex URL Filter Framework
(lib-regex-filter)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository: Registered Extension-Points:
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Nutch Segment Merge Filter
(org.apache.nutch.segment.SegmentMergeFilter)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Nutch Index Writer
(org.apache.nutch.indexer.IndexWriter)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Nutch Content Parser
(org.apache.nutch.parse.Parser)
2014-03-11 14:34:24,755 INFO [main]
org.apache.nutch.plugin.PluginRepository:       Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2014-03-11 14:34:24,775 INFO [main]
org.apache.hadoop.conf.Configuration: found resource
regex-urlfilter.txt at
jar:file:/tmp/hadoop-df/nm-local-dir/usercache/df/appcache/application_1394515173627_0047/filecache/10/job.jar/job.jar!/regex-urlfilter.txt
2014-03-11 14:34:24,841 INFO [main]
org.apache.hadoop.conf.Configuration: found resource
regex-normalize.xml at
jar:file:/tmp/hadoop-df/nm-local-dir/usercache/df/appcache/application_1394515173627_0047/filecache/10/job.jar/job.jar!/regex-normalize.xml
2014-03-11 14:34:24,871 INFO [main]
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer: can't
find rules for scope 'crawldb', using default
2014-03-11 14:36:19,628 INFO [main] org.apache.hadoop.mapred.MapTask:
Spilling map output
2014-03-11 14:36:19,628 INFO [main] org.apache.hadoop.mapred.MapTask:
bufstart = 0; bufend = 79079178; bufvoid = 104857600
2014-03-11 14:36:19,628 INFO [main] org.apache.hadoop.mapred.MapTask:
kvstart = 26214396(104857584); kvend = 25012644(100050576); length =
1201753/6553600
2014-03-11 14:36:19,628 INFO [main] org.apache.hadoop.mapred.MapTask:
(EQUATOR) 80281834 kvi 20070452(80281808)
2014-03-11 14:36:20,995 INFO [SpillThread]
org.apache.hadoop.mapred.MapTask: Finished spill 0
2014-03-11 14:36:20,996 INFO [main] org.apache.hadoop.mapred.MapTask:
(RESET) equator 80281834 kv 20070452(80281808) kvi 20042844(80171376)

On Tue, Mar 4, 2014 at 9:58 AM, Markus Jelsma <[email protected]>wrote:

> Yes, the console shows you what it is doing, stdout as well.
> In your case is it the depth that makes it take so long, it does 30.000
> crawl cycles. We do cycles of around 1000-2000 and that takes between 10
> and 15 minutes and we skip the indexing job (we index in the Fetcher). In
> the end we do around 90-110 cycles every day so 30.000 would take us almost
> a year! :)
>
> If your crawler does not finish all its records before default or
> adaptiveinterval, it won't stop for a long time! :)
>
> -----Original message-----
> > From:S.L <[email protected]>
> > Sent: Tuesday 4th March 2014 8:09
> > To: [email protected]
> > Subject: When can the Nutch MapReduce job be considered complete?
> >
> > Hi All,
> >
> > I have set up  a psuedo distributed cluster using Hadoop 2.3 and runing
> > Nutch 1.7 on it as a MapReduce Job and I use the following command to
> > submit the job.
> >
> > /mnt/hadoop-2.3.0/bin/hadoop jar
> > /opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job
> > org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000 -topN
> > 30000
> >
> > I notice that the crawl is continuing even after 72 hours , I am only
> > crawling 4 websites and have disabled outlinks to external domains . Most
> > of the pages are crawled for the first few hours but then the crawl keeps
> > on running and a very few pages are crawled in those extended crawl
> > sessions. Is my high topN value causing this seemingly never ending
> crawl ?
> >
> > How can I track the status ( from the Hadoop console  or otherwise) ?
> >
> > Thanks.
> >
>

Re: When can the Nutch MapReduce job be considered complete?

Reply via email to