Re: Same pages crawled more than once and slow crawling

alxsss Thu, 18 Oct 2012 12:25:47 -0700

Hello,

I think the problem is with the storage not nutch itself. Looks like generate 
cannot read status or fetch time (or gets null values)  from mysql. 
I had a bunch of issues with mysql storage and switched to hbase at the end.


Alex.

 

 

 

-----Original Message-----
From: Sebastian Nagel <[email protected]>
To: user <[email protected]>
Sent: Thu, Oct 18, 2012 12:08 pm
Subject: Re: Same pages crawled more than once and slow crawling


Hi Luca,

> I'm using Nutch 2.1 on Linux and I'm having similar problem of 
http://goo.gl/nrDLV, my Nutch is
> fetching same pages at each round.
Um... I failed to reproduce the Pierre's problem with
- a simpler configuration
- HBase as back-end (Pierre and Luca both use mysql)

> Then I ran "bin/nutch crawl urls -threads 1"
>
> first.htm was fetched 5 times
> second.htm was fetched 4 times
> third.htm was fetched 3 times
But after the 5th cycle the crawler stopped?

> I tried doing each step separately (inject, generate, ...) with the same 
results.
For Pierre this has worked...
Any suggestions?

> Also the whole process take about 2 minutes, am I missing something about 
> some 
delay config or is
> this normal?
Well, Nutch (resp. Hadoop) are designed to process much data. Job management 
has 
some overhead
(and some artificial sleeps): 5 cycles * 4 jobs (generate/fetch/parse/update) = 
20 jobs.
6s per job seems roughly ok, though it could be slightly faster.

Sebastian

On 10/18/2012 05:55 PM, Luca Vasarelli wrote:
> Hello,
> 
> I'm using Nutch 2.1 on Linux and I'm having similar problem of 
http://goo.gl/nrDLV, my Nutch is
> fetching same pages at each round.
> 
> I've built a simple localhost site, with 3 pages linked each other:
> first.htm -> second.htm -> third.htm
> 
> I did these steps:
> 
> - downloaded nutch 2.1 (source) & untarred to ${TEMP_NUTCH}
> - edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql backend 
(thanks to [1])
> - edited ${TEMP_NUTCH}/conf/gora.properties removing default sql 
> configuration 
and adding mysql
> properties (thanks to [1])
> - ran "ant runtime" from ${TEMP_NUTCH}
> - moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME}
> - edited ${NUTCH_HOME}/conf/nutch-site.xml adding http.agent.name, 
http.robots.agents and changing
> db.ignore.external.links to true and fetcher.server.delay to 0.0
> - created ${NUTCH_HOME}/urls/seed.txt with "http://localhost/test/first.htm"; 
inside this file
> - created db table as [1]
> 
> Then I ran "bin/nutch crawl urls -threads 1"
> 
> first.htm was fetched 5 times
> second.htm was fetched 4 times
> third.htm was fetched 3 times
> 
> I tried doing each step separately (inject, generate, ...) with the same 
results.
> 
> Also the whole process take about 2 minutes, am I missing something about 
> some 
delay config or is
> this normal?
> 
> Some extra info:
> 
> - HTML of the pages: http://pastebin.com/dyDPJeZs
> - Hadoop log: http://pastebin.com/rwQQPnkE
> - nutch-site.xml: http://pastebin.com/0WArkvh5
> - Wireshark log: http://pastebin.com/g4Bg17Ls
> - MySQL table: http://pastebin.com/gD2SvGsy
> 
> [1] http://nlp.solutions.asia/?p=180

Re: Same pages crawled more than once and slow crawling

Reply via email to