> Ok, I did the step manually and it worked. So the prblem did come from the
> crawl command.
It's not the crawl command alone. It worked for me.
Can you try with a minimal nutch-site.xml?
> Is it planned to have a script who already handle this
> generate-fetch-parse-updatedb loop with some tweak
Ok, I did the step manually and it worked. So the prblem did come from the
crawl command.
I did set fetch.store.content = false because I'm only intersted in backlink
crawling.
So you are telling me that there is no way to run nutch in an automatic way ? If I want to do a
crawl of a small par
Hi Pierre,
I tried almost the same with just the default settings
(only the http-agent is set in nutch-site.xml: it's not Googlebot :-O).
All went ok, no documents were crawled twice.
I don't know what exactly went wrong
and didn't find a definitive hint in your logs. Some suggestions:
- the craw
Hi Tejas,
So all urls are concerned by the problem, they are all fetched 3 or 4 times during the crawl, I did
not edit any fetch interval and I didn't see exception.
I did another test, before the test I deleted all the records from webpage
table.
I ran : "bin/nutch crawl seed/ -depth 5 -top
Hi Pierre,
Can you supply some additional information:
1. What is the status of that url now ? if say it is un-fetched in first
round, then it will considered again in 2nd round and so on. Maybe there
might be something with that url which causes some exception and thus
re-tried by nutch in all s
Hello,
I'm using nutch 2.1 with mysql and when I do a simple "bin/nutch crawl seed/
-depth 5 -topN 1", I noticed nutch fetch 3 or 4 times the same URL during
the crawl, why ?
I just configured nutch to local crawl a website (restriction in
regex-urlfilter), everything else looks ok on mys
6 matches
Mail list logo