Hello,
I'm using Nutch 2.1 on Linux and I'm having similar problem of
http://goo.gl/nrDLV, my Nutch is fetching same pages at each round.
I've built a simple localhost site, with 3 pages linked each other:
first.htm -> second.htm -> third.htm
I did these steps:
- downloaded nutch 2.1 (source) & untarred to ${TEMP_NUTCH}
- edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql
backend (thanks to [1])
- edited ${TEMP_NUTCH}/conf/gora.properties removing default sql
configuration and adding mysql properties (thanks to [1])
- ran "ant runtime" from ${TEMP_NUTCH}
- moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME}
- edited ${NUTCH_HOME}/conf/nutch-site.xml adding http.agent.name,
http.robots.agents and changing db.ignore.external.links to true and
fetcher.server.delay to 0.0
- created ${NUTCH_HOME}/urls/seed.txt with
"http://localhost/test/first.htm" inside this file
- created db table as [1]
Then I ran "bin/nutch crawl urls -threads 1"
first.htm was fetched 5 times
second.htm was fetched 4 times
third.htm was fetched 3 times
I tried doing each step separately (inject, generate, ...) with the same
results.
Also the whole process take about 2 minutes, am I missing something
about some delay config or is this normal?
Some extra info:
- HTML of the pages: http://pastebin.com/dyDPJeZs
- Hadoop log: http://pastebin.com/rwQQPnkE
- nutch-site.xml: http://pastebin.com/0WArkvh5
- Wireshark log: http://pastebin.com/g4Bg17Ls
- MySQL table: http://pastebin.com/gD2SvGsy
[1] http://nlp.solutions.asia/?p=180