Hi Tejas, If i put a larger value for topN as 1000 i get job failed error at the end of the fetching. 500 seems to be the optimal value and the fetching completes with this value without any issue.
I am using nutch 1.6 right now; will be also installing 2.1 after i have installed hbase on my windows machine Below is some of the content of the log file:- 2013-01-29 08:44:21,902 INFO crawl.CrawlDbReader - CrawlDb statistics start: crawl/crawldb 2013-01-29 08:44:25,338 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2013-01-29 08:44:35,014 INFO crawl.CrawlDbReader - Statistics for CrawlDb: crawl/crawldb 2013-01-29 08:44:35,014 INFO crawl.CrawlDbReader - TOTAL urls: 96404 2013-01-29 08:44:35,016 INFO crawl.CrawlDbReader - retry 0: 96030 2013-01-29 08:44:35,016 INFO crawl.CrawlDbReader - retry 1: 293 2013-01-29 08:44:35,016 INFO crawl.CrawlDbReader - retry 2: 80 2013-01-29 08:44:35,016 INFO crawl.CrawlDbReader - retry 3: 1 2013-01-29 08:44:35,017 INFO crawl.CrawlDbReader - min score: 0.0 2013-01-29 08:44:35,017 INFO crawl.CrawlDbReader - avg score: 2.8775778E-4 2013-01-29 08:44:35,017 INFO crawl.CrawlDbReader - max score: 3.071 2013-01-29 08:44:35,018 INFO crawl.CrawlDbReader - status 1 (db_unfetched): 85672 2013-01-29 08:44:35,018 INFO crawl.CrawlDbReader - status 2 (db_fetched): 7598 2013-01-29 08:44:35,019 INFO crawl.CrawlDbReader - status 3 (db_gone): 17 2013-01-29 08:44:35,020 INFO crawl.CrawlDbReader - status 4 (db_redir_temp): 449 2013-01-29 08:44:35,021 INFO crawl.CrawlDbReader - status 5 (db_redir_perm): 1115 2013-01-29 08:44:35,024 INFO crawl.CrawlDbReader - status 6 (db_notmodified): 1553 2013-01-29 08:44:35,055 INFO crawl.CrawlDbReader - CrawlDb statistics: done 2013-01-29 08:48:09,474 INFO crawl.Generator - Generator: starting at 2013-01-29 08:48:09 2013-01-29 08:48:09,475 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2013-01-29 08:48:09,475 INFO crawl.Generator - Generator: filtering: true 2013-01-29 08:48:09,476 INFO crawl.Generator - Generator: normalizing: true 2013-01-29 08:48:09,476 INFO crawl.Generator - Generator: topN: 50 2013-01-29 08:48:09,478 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2013-01-29 08:48:10,646 INFO plugin.PluginRepository - Plugins: looking in: C:\apache-nutch-1.6\plugins 2013-01-29 08:48:11,273 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Registered Plugins: 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Registered Extension-Points: 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2013-01-29 08:48:11,274 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2013-01-29 08:48:11,275 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2013-01-29 08:48:11,275 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2013-01-29 08:48:11,275 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2013-01-29 08:48:11,275 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2013-01-29 08:48:11,502 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2013-01-29 08:48:11,502 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2013-01-29 08:48:11,502 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2013-01-29 08:48:26,968 INFO regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2013-01-29 08:49:00,769 INFO crawl.CrawlDbReader - CrawlDb statistics start: crawl/crawldb 2013-01-29 08:49:01,292 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2013-01-29 08:49:04,221 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2013-01-29 08:49:04,221 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2013-01-29 08:49:04,221 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2013-01-29 08:49:04,223 INFO regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default 2013-01-29 08:49:05,395 INFO crawl.Generator - Generator: Partitioning selected urls for politeness. 2013-01-29 08:49:05,594 INFO crawl.CrawlDbReader - Statistics for CrawlDb: crawl/crawldb 2013-01-29 08:49:05,595 INFO crawl.CrawlDbReader - TOTAL urls: 96404 2013-01-29 08:49:05,595 INFO crawl.CrawlDbReader - retry 0: 96030 2013-01-29 08:49:05,595 INFO crawl.CrawlDbReader - retry 1: 293 2013-01-29 08:49:05,595 INFO crawl.CrawlDbReader - retry 2: 80 2013-01-29 08:49:05,595 INFO crawl.CrawlDbReader - retry 3: 1 2013-01-29 08:49:05,596 INFO crawl.CrawlDbReader - min score: 0.0 2013-01-29 08:49:05,596 INFO crawl.CrawlDbReader - avg score: 2.8775778E-4 2013-01-29 08:49:05,596 INFO crawl.CrawlDbReader - max score: 3.071 2013-01-29 08:49:05,596 INFO crawl.CrawlDbReader - status 1 (db_unfetched): 85672 2013-01-29 08:49:05,596 INFO crawl.CrawlDbReader - status 2 (db_fetched): 7598 2013-01-29 08:49:05,596 INFO crawl.CrawlDbReader - status 3 (db_gone): 17 2013-01-29 08:49:05,597 INFO crawl.CrawlDbReader - status 4 (db_redir_temp): 449 2013-01-29 08:49:05,601 INFO crawl.CrawlDbReader - status 5 (db_redir_perm): 1115 2013-01-29 08:49:05,604 INFO crawl.CrawlDbReader - status 6 (db_notmodified): 1553 2013-01-29 08:49:05,622 INFO crawl.CrawlDbReader - CrawlDb statistics: done 2013-01-29 08:49:06,396 INFO crawl.Generator - Generator: segment: crawl/segments/20130129084906 2013-01-29 08:49:07,350 INFO regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2013-01-29 08:49:08,163 INFO crawl.Generator - Generator: finished at 2013-01-29 08:49:08, elapsed: 00:00:58 2013-01-29 08:49:32,971 INFO fetcher.Fetcher - Fetcher: starting at 2013-01-29 08:49:32 2013-01-29 08:49:32,972 INFO fetcher.Fetcher - Fetcher: segment: crawl/segments/20130129084906 2013-01-29 08:49:34,341 INFO fetcher.Fetcher - Using queue mode : byHost 2013-01-29 08:49:34,341 INFO fetcher.Fetcher - Fetcher: threads: 10 2013-01-29 08:49:34,342 INFO fetcher.Fetcher - Fetcher: time-out divisor: 2 2013-01-29 08:49:34,357 INFO plugin.PluginRepository - Plugins: looking in: C:\apache-nutch-1.6\plugins 2013-01-29 08:49:34,361 INFO fetcher.Fetcher - QueueFeeder finished: total 50 records + hit by time limit :0 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Registered Plugins: 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Registered Extension-Points: 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2013-01-29 08:49:34,546 INFO fetcher.Fetcher - Using queue mode : byHost 2013-01-29 08:49:34,548 INFO fetcher.Fetcher - Using queue mode : byHost 2013-01-29 08:49:34,548 INFO fetcher.Fetcher - fetching http://www.example.com 2013-01-29 08:49:34,549 INFO fetcher.Fetcher - Using queue mode : byHost Tejas Patil wrote > Hey Peter, > > Give a bigger value for topN parameter. Also, use: > <property> > > <name> > generate.max.count > </name> > > <value> > -1 > </value> > </property> > > <property> > > <name> > generate.count.mode > </name> > > <value> > domain > </value> > </property> > Not sure why you see queue mode as byhost and not by domain. Did it print > that in the logs ? > I should have asked you this before : Are you using nutch 1.X or 2.x ? > > thanks, > Tejas Patil > > > On Tue, Jan 29, 2013 at 12:08 AM, peterbarretto > < > peterbarretto08@ > >wrote: > >> Hi Tejas, >> >> I changed the generate.count.mode to domain and generate.max.count to 100 >> but still it shows queue mode as byhost and not by domain. >> >> >> >> peterbarretto wrote >> > Hi Tejas >> > >> > The fetcher.threads.per.host property has been depreciated and replaced >> > with fetcher.threads.per.queue >> > I am not sue if fetcher.threads.per.queue will hepl the fetching as the >> > generator only generates the fetchlist from 2 or 3 domain. How can i >> tell >> > the generator to create fetchlist with equal number of urls from all >> > domain? >> > >> > I am sure there are urls from the other domains but i guess since the >> url >> > score is less it fetches from only 2 domains. >> > >> > I will try increasing fetcher.threads.per.queue to 5 and see if the >> fetch >> > speed is increased and let you know >> > Tejas Patil wrote >> >> Hey Peter, >> >> >> >> I am guessing that you have just increased the global thread count. >> Have >> >> you even increased "fetcher.threads.per.host" ? This will improve the >> >> crawl >> >> rate as multiple threads can attack the same site. Dont make it too >> high >> >> or >> >> else the system will get overloaded. The nutch wiki has an article [0] >> >> about the potential reasons for slow crawls and some good suggestions. >> >> >> >> [0] : https://wiki.apache.org/nutch/OptimizingCrawls >> >> >> >> Thanks, >> >> Tejas Patil >> >> >> >> >> >> On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto < >> >> >> peterbarretto08@ >> >> >> >wrote: >> >> >> >>> I tried increasing the numbers of threads to 50 but the speed is not >> >>> affected >> >>> >> >>> >> >>> I tried changing the partition.url.mode value to byDomain and >> >>> fetcher.queue.mode to byDomain but still it does not help the speed. >> >>> It seems to get urls from 2 domains now and the other domains are not >> >>> getting crawled. Is this due to the url score? if so how do i crawl >> urls >> >>> from all the domains? >> >>> >> >>> >> >>> lewis john mcgibbney wrote >> >>> > Increase number of threads when fetching >> >>> > Also please see nutch-deault.xml for paritioning of urls, if you >> know >> >>> your >> >>> > target domains you may wish to adapt the policy. >> >>> > Lewis >> >>> > >> >>> > On Sunday, January 27, 2013, peterbarretto < >> >>> >> >>> > peterbarretto08@ >> >>> >> >>> > > >> >>> > wrote: >> >>> >> I want to increase the number of urls fetched at a time in nutch. >> I >> >>> have >> >>> >> around 10 websites to crawl. so how can i crawl all the sites at a >> >>> time >> >>> ? >> >>> >> right now i am fetching 1 site with a fetch delay of 2 second but >> it >> >>> is >> >>> > too >> >>> >> slow. How to concurrently fetch from different domain? >> >>> >> >> >>> >> >> >>> >> >> >>> >> -- >> >>> >> View this message in context: >> >>> > >> >>> >> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html >> >>> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >>> >> >> >>> > >> >>> > -- >> >>> > *Lewis* >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> View this message in context: >> >>> >> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html >> >>> Sent from the Nutch - User mailing list archive at Nabble.com. >> >>> >> >> >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036976.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4037282.html Sent from the Nutch - User mailing list archive at Nabble.com.

