Re: same page fetched severals times in one crawl

Pierre Mon, 15 Oct 2012 23:24:08 -0700

Ok, I did the step manually and it worked. So the prblem did come from the 
crawl command.


I did set fetch.store.content = false because I'm only intersted in backlink 
crawling.

So you are telling me that there is no way to run nutch in an automatic way ? If I want to do acrawl of a small part of the web, I am supposed to repeat the step manually or make a script whowill loop between generate/fetch/parse/updatedb ? It doesn't sound good...

Is it planned to have a script who already handle this generate-fetch-parse-updatedb loop with sometweak like maximum depth of the crawl, maximum time of the crawl ?



On 15/10/2012 22:11, Sebastian Nagel wrote:

Hi Pierre,

I tried almost the same with just the default settings
(only the http-agent is set in nutch-site.xml: it's not Googlebot :-O).
All went ok, no documents were crawled twice.
I don't know what exactly went wrong
and didn't find a definitive hint in your logs. Some suggestions:

- the crawl command is deprecated, see 
https://issues.apache.org/jira/browse/NUTCH-1087

- you should try to perform the steps
     inject
     generate
     fetch
     parse
     updatedb
   "by hand". This gives you more insights what is going on.
   Repeat the steps generate, fetch, parse, updatedb as many times as needed.
   There are many tutorials out there how to crawl step-by-step, eg.
    http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html
   Finally, of course, but (sorry) it's rather short:
    http://wiki.apache.org/nutch/Nutch2Tutorial

- set fetcher.parse = false and fetcher.store.content = true

Good luck,

Sebastian


On 10/15/2012 02:27 PM, Pierre wrote:

Hi Tejas,

So all urls are concerned by the problem, they are all fetched 3 or 4 times 
during the crawl, I did
not edit any fetch interval and I didn't see exception.

I did another test, before the test I deleted all the records from webpage 
table.

I ran : "bin/nutch crawl seed/ -depth 5 -topN 10000" with seed url 
http://serphacker.com/crawltest/

The apache logs of the remote server : http://pastebin.com/tkMPmpuK
The hadoop.log : http://pastebin.com/xRCuKQ5g
The id,status of the webpage table at the end of the crawl : 
http://pastebin.com/ZVUC5As5
The nutch-site.xml : http://pastebin.com/WD5Cyyin
The regex url filter : +https?://.*serphacker\.com/crawltest/
nutch-default.xml not edited



On 13/10/2012 20:50, Tejas Patil wrote:

Hi Pierre,

Can you supply some additional information:

1. What is the status of that url now ? if say it is un-fetched in first
round, then it will considered again in 2nd round and so on. Maybe there
might be something with that url which causes some exception and thus
re-tried by nutch in all subsequent rounds.

2. I guess you have not modified the fetch interval for urls. Typically its
set to 30 days but if changed to say 4 secs by user then it will cause that
url to be eligible to be fetched in the next round itself.

3. Did you observe any exceptions in any logs ? please share those.

Thanks,
Tejas

On Sat, Oct 13, 2012 at 10:07 AM, Pierre Nogues <[email protected]> wrote:


Hello,

I'm using nutch 2.1 with mysql and when I do a simple "bin/nutch crawl
seed/ -depth 5 -topN 10000", I noticed nutch fetch 3 or 4 times the same
URL during the crawl, why ?

I just configured nutch to local crawl a website (restriction in
regex-urlfilter), everything else looks ok on mysql.

nuch-site.xml : http://pastebin.com/Mx9s5Kfz

Re: same page fetched severals times in one crawl

Reply via email to