Hello,

I'm using Nutch 2.1 on Linux and I'm having similar problem of http://goo.gl/nrDLV, my Nutch is fetching same pages at each round.

I've built a simple localhost site, with 3 pages linked each other:
first.htm -> second.htm -> third.htm

I did these steps:

- downloaded nutch 2.1 (source) & untarred to ${TEMP_NUTCH}
- edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql backend (thanks to [1]) - edited ${TEMP_NUTCH}/conf/gora.properties removing default sql configuration and adding mysql properties (thanks to [1])
- ran "ant runtime" from ${TEMP_NUTCH}
- moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME}
- edited ${NUTCH_HOME}/conf/nutch-site.xml adding http.agent.name, http.robots.agents and changing db.ignore.external.links to true and fetcher.server.delay to 0.0 - created ${NUTCH_HOME}/urls/seed.txt with "http://localhost/test/first.htm"; inside this file
- created db table as [1]

Then I ran "bin/nutch crawl urls -threads 1"

first.htm was fetched 5 times
second.htm was fetched 4 times
third.htm was fetched 3 times

I tried doing each step separately (inject, generate, ...) with the same results.

Also the whole process take about 2 minutes, am I missing something about some delay config or is this normal?

Some extra info:

- HTML of the pages: http://pastebin.com/dyDPJeZs
- Hadoop log: http://pastebin.com/rwQQPnkE
- nutch-site.xml: http://pastebin.com/0WArkvh5
- Wireshark log: http://pastebin.com/g4Bg17Ls
- MySQL table: http://pastebin.com/gD2SvGsy

[1] http://nlp.solutions.asia/?p=180

Reply via email to