Re: same page fetched severals times in one crawl

Sebastian Nagel Mon, 15 Oct 2012 13:11:56 -0700

Hi Pierre,

I tried almost the same with just the default settings
(only the http-agent is set in nutch-site.xml: it's not Googlebot :-O).
All went ok, no documents were crawled twice.
I don't know what exactly went wrong
and didn't find a definitive hint in your logs. Some suggestions:


- the crawl command is deprecated, see 
https://issues.apache.org/jira/browse/NUTCH-1087

- you should try to perform the steps
    inject
    generate
    fetch
    parse
    updatedb
  "by hand". This gives you more insights what is going on.
  Repeat the steps generate, fetch, parse, updatedb as many times as needed.
  There are many tutorials out there how to crawl step-by-step, eg.
   http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html
  Finally, of course, but (sorry) it's rather short:
   http://wiki.apache.org/nutch/Nutch2Tutorial

- set fetcher.parse = false and fetcher.store.content = true

Good luck,

Sebastian


On 10/15/2012 02:27 PM, Pierre wrote:
> Hi Tejas,
> 
> So all urls are concerned by the problem, they are all fetched 3 or 4 times 
> during the crawl, I did
> not edit any fetch interval and I didn't see exception.
> 
> I did another test, before the test I deleted all the records from webpage 
> table.
> 
> I ran : "bin/nutch crawl seed/ -depth 5 -topN 10000" with seed url 
> http://serphacker.com/crawltest/
> 
> The apache logs of the remote server : http://pastebin.com/tkMPmpuK
> The hadoop.log : http://pastebin.com/xRCuKQ5g
> The id,status of the webpage table at the end of the crawl : 
> http://pastebin.com/ZVUC5As5
> The nutch-site.xml : http://pastebin.com/WD5Cyyin
> The regex url filter : +https?://.*serphacker\.com/crawltest/
> nutch-default.xml not edited
> 
> 
> 
> On 13/10/2012 20:50, Tejas Patil wrote:
>> Hi Pierre,
>>
>> Can you supply some additional information:
>>
>> 1. What is the status of that url now ? if say it is un-fetched in first
>> round, then it will considered again in 2nd round and so on. Maybe there
>> might be something with that url which causes some exception and thus
>> re-tried by nutch in all subsequent rounds.
>>
>> 2. I guess you have not modified the fetch interval for urls. Typically its
>> set to 30 days but if changed to say 4 secs by user then it will cause that
>> url to be eligible to be fetched in the next round itself.
>>
>> 3. Did you observe any exceptions in any logs ? please share those.
>>
>> Thanks,
>> Tejas
>>
>> On Sat, Oct 13, 2012 at 10:07 AM, Pierre Nogues <[email protected]> wrote:
>>
>>>
>>> Hello,
>>>
>>> I'm using nutch 2.1 with mysql and when I do a simple "bin/nutch crawl
>>> seed/ -depth 5 -topN 10000", I noticed nutch fetch 3 or 4 times the same
>>> URL during the crawl, why ?
>>>
>>> I just configured nutch to local crawl a website (restriction in
>>> regex-urlfilter), everything else looks ok on mysql.
>>>
>>> nuch-site.xml : http://pastebin.com/Mx9s5Kfz
>>>
>>>
>>>
>>

Re: same page fetched severals times in one crawl

Reply via email to