Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

peterbarretto Wed, 30 Jan 2013 19:53:05 -0800

Hi Tejas,

If i put a larger value for topN as 1000 i get job failed error at the end
of the fetching.
500 seems to be the optimal value and the fetching completes with this value
without any issue.


I am using nutch 1.6 right now; will be also installing 2.1 after i have
installed hbase on my windows machine

Below is some of the content of the log file:-

2013-01-29 08:44:21,902 INFO  crawl.CrawlDbReader - CrawlDb statistics
start: crawl/crawldb
2013-01-29 08:44:25,338 WARN  mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
2013-01-29 08:44:35,014 INFO  crawl.CrawlDbReader - Statistics for CrawlDb:
crawl/crawldb
2013-01-29 08:44:35,014 INFO  crawl.CrawlDbReader - TOTAL urls: 96404
2013-01-29 08:44:35,016 INFO  crawl.CrawlDbReader - retry 0:    96030
2013-01-29 08:44:35,016 INFO  crawl.CrawlDbReader - retry 1:    293
2013-01-29 08:44:35,016 INFO  crawl.CrawlDbReader - retry 2:    80
2013-01-29 08:44:35,016 INFO  crawl.CrawlDbReader - retry 3:    1
2013-01-29 08:44:35,017 INFO  crawl.CrawlDbReader - min score:  0.0
2013-01-29 08:44:35,017 INFO  crawl.CrawlDbReader - avg score:  2.8775778E-4
2013-01-29 08:44:35,017 INFO  crawl.CrawlDbReader - max score:  3.071
2013-01-29 08:44:35,018 INFO  crawl.CrawlDbReader - status 1 (db_unfetched):
85672
2013-01-29 08:44:35,018 INFO  crawl.CrawlDbReader - status 2 (db_fetched):
7598
2013-01-29 08:44:35,019 INFO  crawl.CrawlDbReader - status 3 (db_gone): 17
2013-01-29 08:44:35,020 INFO  crawl.CrawlDbReader - status 4
(db_redir_temp):        449
2013-01-29 08:44:35,021 INFO  crawl.CrawlDbReader - status 5
(db_redir_perm):        1115
2013-01-29 08:44:35,024 INFO  crawl.CrawlDbReader - status 6
(db_notmodified):       1553
2013-01-29 08:44:35,055 INFO  crawl.CrawlDbReader - CrawlDb statistics: done
2013-01-29 08:48:09,474 INFO  crawl.Generator - Generator: starting at
2013-01-29 08:48:09
2013-01-29 08:48:09,475 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2013-01-29 08:48:09,475 INFO  crawl.Generator - Generator: filtering: true
2013-01-29 08:48:09,476 INFO  crawl.Generator - Generator: normalizing: true
2013-01-29 08:48:09,476 INFO  crawl.Generator - Generator: topN: 50
2013-01-29 08:48:09,478 INFO  crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2013-01-29 08:48:10,646 INFO  plugin.PluginRepository - Plugins: looking in:
C:\apache-nutch-1.6\plugins
2013-01-29 08:48:11,273 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - Registered Plugins:
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Regex URL
Normalizer (urlnormalizer-regex)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         CyberNeko HTML
Parser (lib-nekohtml)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Basic URL
Normalizer (urlnormalizer-basic)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Tika Parser 
Plug-in
(parse-tika)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Basic Indexing
Filter (index-basic)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Html Parse 
Plug-in
(parse-html)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Anchor Indexing
Filter (index-anchor)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         HTTP Framework
(lib-http)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Regex URL Filter
(urlfilter-regex)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Regex URL Filter
Framework (lib-regex-filter)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Pass-through URL
Normalizer (urlnormalizer-pass)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Http Protocol
Plug-in (protocol-http)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository - Registered
Extension-Points:
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Nutch Segment 
Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
2013-01-29 08:48:11,274 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2013-01-29 08:48:11,275 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2013-01-29 08:48:11,275 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2013-01-29 08:48:11,275 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2013-01-29 08:48:11,275 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2013-01-29 08:48:11,502 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-01-29 08:48:11,502 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-01-29 08:48:11,502 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-01-29 08:48:26,968 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2013-01-29 08:49:00,769 INFO  crawl.CrawlDbReader - CrawlDb statistics
start: crawl/crawldb
2013-01-29 08:49:01,292 WARN  mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
2013-01-29 08:49:04,221 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-01-29 08:49:04,221 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-01-29 08:49:04,221 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-01-29 08:49:04,223 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
2013-01-29 08:49:05,395 INFO  crawl.Generator - Generator: Partitioning
selected urls for politeness.
2013-01-29 08:49:05,594 INFO  crawl.CrawlDbReader - Statistics for CrawlDb:
crawl/crawldb
2013-01-29 08:49:05,595 INFO  crawl.CrawlDbReader - TOTAL urls: 96404
2013-01-29 08:49:05,595 INFO  crawl.CrawlDbReader - retry 0:    96030
2013-01-29 08:49:05,595 INFO  crawl.CrawlDbReader - retry 1:    293
2013-01-29 08:49:05,595 INFO  crawl.CrawlDbReader - retry 2:    80
2013-01-29 08:49:05,595 INFO  crawl.CrawlDbReader - retry 3:    1
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - min score:  0.0
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - avg score:  2.8775778E-4
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - max score:  3.071
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - status 1 (db_unfetched):
85672
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - status 2 (db_fetched):
7598
2013-01-29 08:49:05,596 INFO  crawl.CrawlDbReader - status 3 (db_gone): 17
2013-01-29 08:49:05,597 INFO  crawl.CrawlDbReader - status 4
(db_redir_temp):        449
2013-01-29 08:49:05,601 INFO  crawl.CrawlDbReader - status 5
(db_redir_perm):        1115
2013-01-29 08:49:05,604 INFO  crawl.CrawlDbReader - status 6
(db_notmodified):       1553
2013-01-29 08:49:05,622 INFO  crawl.CrawlDbReader - CrawlDb statistics: done
2013-01-29 08:49:06,396 INFO  crawl.Generator - Generator: segment:
crawl/segments/20130129084906
2013-01-29 08:49:07,350 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2013-01-29 08:49:08,163 INFO  crawl.Generator - Generator: finished at
2013-01-29 08:49:08, elapsed: 00:00:58
2013-01-29 08:49:32,971 INFO  fetcher.Fetcher - Fetcher: starting at
2013-01-29 08:49:32
2013-01-29 08:49:32,972 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20130129084906
2013-01-29 08:49:34,341 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-01-29 08:49:34,341 INFO  fetcher.Fetcher - Fetcher: threads: 10
2013-01-29 08:49:34,342 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
2013-01-29 08:49:34,357 INFO  plugin.PluginRepository - Plugins: looking in:
C:\apache-nutch-1.6\plugins
2013-01-29 08:49:34,361 INFO  fetcher.Fetcher - QueueFeeder finished: total
50 records + hit by time limit :0
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - Registered Plugins:
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         Regex URL
Normalizer (urlnormalizer-regex)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         CyberNeko HTML
Parser (lib-nekohtml)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         Basic URL
Normalizer (urlnormalizer-basic)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         Tika Parser 
Plug-in
(parse-tika)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         Basic Indexing
Filter (index-basic)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         Html Parse 
Plug-in
(parse-html)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         Anchor Indexing
Filter (index-anchor)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         HTTP Framework
(lib-http)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         Regex URL Filter
(urlfilter-regex)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         Regex URL Filter
Framework (lib-regex-filter)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         Pass-through URL
Normalizer (urlnormalizer-pass)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         Http Protocol
Plug-in (protocol-http)
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository - Registered
Extension-Points:
2013-01-29 08:49:34,476 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository -         Nutch Segment 
Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2013-01-29 08:49:34,477 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2013-01-29 08:49:34,546 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-01-29 08:49:34,548 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-01-29 08:49:34,548 INFO  fetcher.Fetcher - fetching
http://www.example.com
2013-01-29 08:49:34,549 INFO  fetcher.Fetcher - Using queue mode : byHost



Tejas Patil wrote
> Hey Peter,
> 
> Give a bigger value for topN parameter. Also, use:
> <property>
>   
> <name>
> generate.max.count
> </name>
>   
> <value>
> -1
> </value>
> </property>
> 
> <property>
>   
> <name>
> generate.count.mode
> </name>
>   
> <value>
> domain
> </value>
> </property>
> Not sure why you see queue mode as byhost and not by domain. Did it print
> that in the logs ?
> I should have asked you this before : Are you using nutch 1.X or 2.x ?
> 
> thanks,
> Tejas Patil
> 
> 
> On Tue, Jan 29, 2013 at 12:08 AM, peterbarretto
> &lt;

> peterbarretto08@

> &gt;wrote:
> 
>> Hi Tejas,
>>
>> I changed the generate.count.mode to domain and generate.max.count to 100
>> but still it shows queue mode as byhost and not by domain.
>>
>>
>>
>> peterbarretto wrote
>> > Hi Tejas
>> >
>> > The fetcher.threads.per.host property has been depreciated and replaced
>> > with fetcher.threads.per.queue
>> > I am not sue if fetcher.threads.per.queue will hepl the fetching as the
>> > generator only generates the fetchlist from 2 or 3 domain. How can i
>> tell
>> > the generator to create fetchlist with equal number of urls from all
>> > domain?
>> >
>> > I am sure there are urls from the other domains but i guess since the
>> url
>> > score is less it fetches from only 2 domains.
>> >
>> > I will try increasing fetcher.threads.per.queue to 5 and see if the
>> fetch
>> > speed is increased and let you know
>> > Tejas Patil wrote
>> >> Hey Peter,
>> >>
>> >> I am guessing that you have just increased the global thread count.
>> Have
>> >> you even increased "fetcher.threads.per.host" ? This will improve the
>> >> crawl
>> >> rate as multiple threads can attack the same site. Dont make it too
>> high
>> >> or
>> >> else the system will get overloaded. The nutch wiki has an article [0]
>> >> about the potential reasons for slow crawls and some good suggestions.
>> >>
>> >> [0] : https://wiki.apache.org/nutch/OptimizingCrawls
>> >>
>> >> Thanks,
>> >> Tejas Patil
>> >>
>> >>
>> >> On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto &lt;
>>
>> >> peterbarretto08@
>>
>> >> &gt;wrote:
>> >>
>> >>> I tried increasing the numbers of threads to 50 but the speed is not
>> >>> affected
>> >>>
>> >>>
>> >>> I tried changing the partition.url.mode value to byDomain and
>> >>> fetcher.queue.mode to byDomain but still it does not help the speed.
>> >>> It seems to get urls from 2 domains now and the other domains are not
>> >>> getting crawled. Is this due to the url score? if so how do i crawl
>> urls
>> >>> from all the domains?
>> >>>
>> >>>
>> >>> lewis john mcgibbney wrote
>> >>> > Increase number of threads when fetching
>> >>> > Also please see nutch-deault.xml for paritioning of urls, if you
>> know
>> >>> your
>> >>> > target domains you may wish to adapt the policy.
>> >>> > Lewis
>> >>> >
>> >>> > On Sunday, January 27, 2013, peterbarretto &lt;
>> >>>
>> >>> > peterbarretto08@
>> >>>
>> >>> > &gt;
>> >>> > wrote:
>> >>> >> I want to increase the number of urls fetched at a time in nutch.
>> I
>> >>> have
>> >>> >> around 10 websites to crawl. so how can i crawl all the sites at a
>> >>> time
>> >>> ?
>> >>> >> right now i am fetching 1 site with a fetch delay of 2 second but
>> it
>> >>> is
>> >>> > too
>> >>> >> slow. How to concurrently fetch from different domain?
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> --
>> >>> >> View this message in context:
>> >>> >
>> >>>
>> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
>> >>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>> >>
>> >>> >
>> >>> > --
>> >>> > *Lewis*
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
>> >>> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036976.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>





--
View this message in context: 
http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4037282.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

Reply via email to