Ok, I did the step manually and it worked. So the prblem did come from the
crawl command.
I did set fetch.store.content = false because I'm only intersted in backlink
crawling.
So you are telling me that there is no way to run nutch in an automatic way ? If I want to do a
crawl of a small part of the web, I am supposed to repeat the step manually or make a script who
will loop between generate/fetch/parse/updatedb ? It doesn't sound good...
Is it planned to have a script who already handle this generate-fetch-parse-updatedb loop with some
tweak like maximum depth of the crawl, maximum time of the crawl ?
On 15/10/2012 22:11, Sebastian Nagel wrote:
Hi Pierre,
I tried almost the same with just the default settings
(only the http-agent is set in nutch-site.xml: it's not Googlebot :-O).
All went ok, no documents were crawled twice.
I don't know what exactly went wrong
and didn't find a definitive hint in your logs. Some suggestions:
- the crawl command is deprecated, see
https://issues.apache.org/jira/browse/NUTCH-1087
- you should try to perform the steps
inject
generate
fetch
parse
updatedb
"by hand". This gives you more insights what is going on.
Repeat the steps generate, fetch, parse, updatedb as many times as needed.
There are many tutorials out there how to crawl step-by-step, eg.
http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html
Finally, of course, but (sorry) it's rather short:
http://wiki.apache.org/nutch/Nutch2Tutorial
- set fetcher.parse = false and fetcher.store.content = true
Good luck,
Sebastian
On 10/15/2012 02:27 PM, Pierre wrote:
Hi Tejas,
So all urls are concerned by the problem, they are all fetched 3 or 4 times
during the crawl, I did
not edit any fetch interval and I didn't see exception.
I did another test, before the test I deleted all the records from webpage
table.
I ran : "bin/nutch crawl seed/ -depth 5 -topN 10000" with seed url
http://serphacker.com/crawltest/
The apache logs of the remote server : http://pastebin.com/tkMPmpuK
The hadoop.log : http://pastebin.com/xRCuKQ5g
The id,status of the webpage table at the end of the crawl :
http://pastebin.com/ZVUC5As5
The nutch-site.xml : http://pastebin.com/WD5Cyyin
The regex url filter : +https?://.*serphacker\.com/crawltest/
nutch-default.xml not edited
On 13/10/2012 20:50, Tejas Patil wrote:
Hi Pierre,
Can you supply some additional information:
1. What is the status of that url now ? if say it is un-fetched in first
round, then it will considered again in 2nd round and so on. Maybe there
might be something with that url which causes some exception and thus
re-tried by nutch in all subsequent rounds.
2. I guess you have not modified the fetch interval for urls. Typically its
set to 30 days but if changed to say 4 secs by user then it will cause that
url to be eligible to be fetched in the next round itself.
3. Did you observe any exceptions in any logs ? please share those.
Thanks,
Tejas
On Sat, Oct 13, 2012 at 10:07 AM, Pierre Nogues <[email protected]> wrote:
Hello,
I'm using nutch 2.1 with mysql and when I do a simple "bin/nutch crawl
seed/ -depth 5 -topN 10000", I noticed nutch fetch 3 or 4 times the same
URL during the crawl, why ?
I just configured nutch to local crawl a website (restriction in
regex-urlfilter), everything else looks ok on mysql.
nuch-site.xml : http://pastebin.com/Mx9s5Kfz