Re: same page fetched severals times in one crawl

Sebastian Nagel Tue, 16 Oct 2012 00:10:49 -0700

> Ok, I did the step manually and it worked. So the prblem did come from the
> crawl command.
It's not the crawl command alone. It worked for me.
Can you try with a minimal nutch-site.xml?


> Is it planned to have a script who already handle this
> generate-fetch-parse-updatedb loop with some tweak like maximum depth of the
> crawl, maximum time of the crawl ?
Have a look at the patches of NUTCH-1087 there is also a patch for 2.x
(but see Julien's comment: "needs testing"). If you could test it and share
your experience, it would help us much.
Of course, the script has an argument to limit the crawl cycles
(equiv. to -depth).
For maximum time, see the property fetcher.timelimit.mins (as a rough
equivalent).

2012/10/16 Pierre <[email protected]>:
> Ok, I did the step manually and it worked. So the prblem did come from the
> crawl command.
>
> I did set fetch.store.content = false because I'm only intersted in backlink
> crawling.
>
> So you are telling me that there is no way to run nutch in an automatic way
> ? If I want to do a crawl of a small part of the web, I am supposed to
> repeat the step manually or make a script who will loop between
> generate/fetch/parse/updatedb ? It doesn't sound good...
>
> Is it planned to have a script who already handle this
> generate-fetch-parse-updatedb loop with some tweak like maximum depth of the
> crawl, maximum time of the crawl ?
>
>
>
> On 15/10/2012 22:11, Sebastian Nagel wrote:
>>
>> Hi Pierre,
>>
>> I tried almost the same with just the default settings
>> (only the http-agent is set in nutch-site.xml: it's not Googlebot :-O).
>> All went ok, no documents were crawled twice.
>> I don't know what exactly went wrong
>> and didn't find a definitive hint in your logs. Some suggestions:
>>
>> - the crawl command is deprecated, see
>> https://issues.apache.org/jira/browse/NUTCH-1087
>>
>> - you should try to perform the steps
>>      inject
>>      generate
>>      fetch
>>      parse
>>      updatedb
>>    "by hand". This gives you more insights what is going on.
>>    Repeat the steps generate, fetch, parse, updatedb as many times as
>> needed.
>>    There are many tutorials out there how to crawl step-by-step, eg.
>>
>> http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html
>>    Finally, of course, but (sorry) it's rather short:
>>     http://wiki.apache.org/nutch/Nutch2Tutorial
>>
>> - set fetcher.parse = false and fetcher.store.content = true
>>
>> Good luck,
>>
>> Sebastian
>>
>>
>> On 10/15/2012 02:27 PM, Pierre wrote:
>>>
>>> Hi Tejas,
>>>
>>> So all urls are concerned by the problem, they are all fetched 3 or 4
>>> times during the crawl, I did
>>> not edit any fetch interval and I didn't see exception.
>>>
>>> I did another test, before the test I deleted all the records from
>>> webpage table.
>>>
>>> I ran : "bin/nutch crawl seed/ -depth 5 -topN 10000" with seed url
>>> http://serphacker.com/crawltest/
>>>
>>> The apache logs of the remote server : http://pastebin.com/tkMPmpuK
>>> The hadoop.log : http://pastebin.com/xRCuKQ5g
>>> The id,status of the webpage table at the end of the crawl :
>>> http://pastebin.com/ZVUC5As5
>>> The nutch-site.xml : http://pastebin.com/WD5Cyyin
>>> The regex url filter : +https?://.*serphacker\.com/crawltest/
>>> nutch-default.xml not edited
>>>
>>>
>>>
>>> On 13/10/2012 20:50, Tejas Patil wrote:
>>>>
>>>> Hi Pierre,
>>>>
>>>> Can you supply some additional information:
>>>>
>>>> 1. What is the status of that url now ? if say it is un-fetched in first
>>>> round, then it will considered again in 2nd round and so on. Maybe there
>>>> might be something with that url which causes some exception and thus
>>>> re-tried by nutch in all subsequent rounds.
>>>>
>>>> 2. I guess you have not modified the fetch interval for urls. Typically
>>>> its
>>>> set to 30 days but if changed to say 4 secs by user then it will cause
>>>> that
>>>> url to be eligible to be fetched in the next round itself.
>>>>
>>>> 3. Did you observe any exceptions in any logs ? please share those.
>>>>
>>>> Thanks,
>>>> Tejas
>>>>
>>>> On Sat, Oct 13, 2012 at 10:07 AM, Pierre Nogues <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> I'm using nutch 2.1 with mysql and when I do a simple "bin/nutch crawl
>>>>> seed/ -depth 5 -topN 10000", I noticed nutch fetch 3 or 4 times the
>>>>> same
>>>>> URL during the crawl, why ?
>>>>>
>>>>> I just configured nutch to local crawl a website (restriction in
>>>>> regex-urlfilter), everything else looks ok on mysql.
>>>>>
>>>>> nuch-site.xml : http://pastebin.com/Mx9s5Kfz
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>>
>

Re: same page fetched severals times in one crawl

Reply via email to