Markus, Coming back to your response , my depth is 1000 as well , it's the topN that is 30000 , can you please mention what values you use for depth and topN again?
In my single node YARN cluster I see that the nutch does not crawl any more documents after a certain while while the job is still executing on YARN , and I keep getting logs of the similar type below , not sure what to infer from these . 2014-03-11 14:34:21,890 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2014-03-11 14:34:21,917 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2014-03-11 14:34:21,939 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-03-11 14:34:22,171 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2014-03-11 14:34:22,249 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2014-03-11 14:34:22,249 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started 2014-03-11 14:34:22,260 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens: 2014-03-11 14:34:22,304 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1394515173627_0047, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@6226d537) 2014-03-11 14:34:22,386 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now. 2014-03-11 14:34:22,908 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /tmp/hadoop-df/nm-local-dir/usercache/df/appcache/application_1394515173627_0047 2014-03-11 14:34:23,133 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2014-03-11 14:34:23,148 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2014-03-11 14:34:23,545 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 2014-03-11 14:34:24,041 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ] 2014-03-11 14:34:24,253 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://localhost:9000/user/df/crawldirectory/crawldb/current/part-00000/data:134217728+134217728 2014-03-11 14:34:24,335 INFO [main] org.apache.hadoop.mapred.MapTask: numReduceTasks: 1 2014-03-11 14:34:24,343 INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 2014-03-11 14:34:24,380 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 2014-03-11 14:34:24,381 INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 100 2014-03-11 14:34:24,381 INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 83886080 2014-03-11 14:34:24,381 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 104857600 2014-03-11 14:34:24,381 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 26214396; length = 6553600 2014-03-11 14:34:24,398 INFO [main] org.apache.nutch.plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-df/nm-local-dir/usercache/df/appcache/application_1394515173627_0047/filecache/10/job.jar/classes/plugins 2014-03-11 14:34:24,754 INFO [main] org.apache.nutch.plugin.PluginRepository: Plugin Auto-activation mode: [true] 2014-03-11 14:34:24,754 INFO [main] org.apache.nutch.plugin.PluginRepository: Registered Plugins: 2014-03-11 14:34:24,754 INFO [main] org.apache.nutch.plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 2014-03-11 14:34:24,754 INFO [main] org.apache.nutch.plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 2014-03-11 14:34:24,754 INFO [main] org.apache.nutch.plugin.PluginRepository: Html Parse Plug-in (parse-html) 2014-03-11 14:34:24,754 INFO [main] org.apache.nutch.plugin.PluginRepository: Basic Indexing Filter (index-basic) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: HTTP Framework (lib-http) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: More Indexing Filter (index-more) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Anchor Indexing Filter (index-anchor) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Registered Extension-Points: 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Nutch Index Writer (org.apache.nutch.indexer.IndexWriter) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 2014-03-11 14:34:24,755 INFO [main] org.apache.nutch.plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2014-03-11 14:34:24,775 INFO [main] org.apache.hadoop.conf.Configuration: found resource regex-urlfilter.txt at jar:file:/tmp/hadoop-df/nm-local-dir/usercache/df/appcache/application_1394515173627_0047/filecache/10/job.jar/job.jar!/regex-urlfilter.txt 2014-03-11 14:34:24,841 INFO [main] org.apache.hadoop.conf.Configuration: found resource regex-normalize.xml at jar:file:/tmp/hadoop-df/nm-local-dir/usercache/df/appcache/application_1394515173627_0047/filecache/10/job.jar/job.jar!/regex-normalize.xml 2014-03-11 14:34:24,871 INFO [main] org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer: can't find rules for scope 'crawldb', using default 2014-03-11 14:36:19,628 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output 2014-03-11 14:36:19,628 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 79079178; bufvoid = 104857600 2014-03-11 14:36:19,628 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 26214396(104857584); kvend = 25012644(100050576); length = 1201753/6553600 2014-03-11 14:36:19,628 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 80281834 kvi 20070452(80281808) 2014-03-11 14:36:20,995 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 0 2014-03-11 14:36:20,996 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 80281834 kv 20070452(80281808) kvi 20042844(80171376) On Tue, Mar 4, 2014 at 9:58 AM, Markus Jelsma <[email protected]>wrote: > Yes, the console shows you what it is doing, stdout as well. > In your case is it the depth that makes it take so long, it does 30.000 > crawl cycles. We do cycles of around 1000-2000 and that takes between 10 > and 15 minutes and we skip the indexing job (we index in the Fetcher). In > the end we do around 90-110 cycles every day so 30.000 would take us almost > a year! :) > > If your crawler does not finish all its records before default or > adaptiveinterval, it won't stop for a long time! :) > > -----Original message----- > > From:S.L <[email protected]> > > Sent: Tuesday 4th March 2014 8:09 > > To: [email protected] > > Subject: When can the Nutch MapReduce job be considered complete? > > > > Hi All, > > > > I have set up a psuedo distributed cluster using Hadoop 2.3 and runing > > Nutch 1.7 on it as a MapReduce Job and I use the following command to > > submit the job. > > > > /mnt/hadoop-2.3.0/bin/hadoop jar > > /opt/dfconfig/nutch/apache-nutch-1.8-SNAPSHOT.job > > org.apache.nutch.crawl.Crawl /urls -dir crawldirectory -depth 1000 -topN > > 30000 > > > > I notice that the crawl is continuing even after 72 hours , I am only > > crawling 4 websites and have disabled outlinks to external domains . Most > > of the pages are crawled for the first few hours but then the crawl keeps > > on running and a very few pages are crawled in those extended crawl > > sessions. Is my high topN value causing this seemingly never ending > crawl ? > > > > How can I track the status ( from the Hadoop console or otherwise) ? > > > > Thanks. > > >

