Hi,
I am first time user of nutch. I installed
nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single webpage.

I am running nutch step by step. These are the problems I came across -

1. Inject did not work, i..e the url does not reflect in the
webdb(gora-memstore). The way I verify this is after running inject, i run
readdb with dump. This created a directory in hdfs with 0 size part file.

2. config files - This confused me a lot. When run from deploy directory,
does nutch use the config files from local/conf? Changes made to
local/conf/nutch-site.xml did not take effect after editing this file. I
had to edit this in order to get rid of the 'http.agent.name' error. I
finally ended up hard-coding this in the code, rebuilding and running to
keep going forward.

3. how to interpret readdb - Running readdb -stats, shows a lot out output
but I do not see my url from seed.txt in there. So I do not know if the
entry in webdb actually reflects my seed.txt at all or not.

4. logs - When nutch is run from the deploy directory, the logs/hadoop.log
is not generated anymore, not locally, nor on the grid. I tried to make it
verbose by changing log4j.properties to DEBUG, but still had not file
generated.

Any help with this would help me move forward with nutch.

Regards
Hemant

Reply via email to