Questions/issues with nutch

h b Wed, 26 Jun 2013 17:19:18 -0700

Hi,
I am first time user of nutch. I installed
nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single webpage.


I am running nutch step by step. These are the problems I came across -

1. Inject did not work, i..e the url does not reflect in the
webdb(gora-memstore). The way I verify this is after running inject, i run
readdb with dump. This created a directory in hdfs with 0 size part file.

2. config files - This confused me a lot. When run from deploy directory,
does nutch use the config files from local/conf? Changes made to
local/conf/nutch-site.xml did not take effect after editing this file. I
had to edit this in order to get rid of the 'http.agent.name' error. I
finally ended up hard-coding this in the code, rebuilding and running to
keep going forward.

3. how to interpret readdb - Running readdb -stats, shows a lot out output
but I do not see my url from seed.txt in there. So I do not know if the
entry in webdb actually reflects my seed.txt at all or not.

4. logs - When nutch is run from the deploy directory, the logs/hadoop.log
is not generated anymore, not locally, nor on the grid. I tried to make it
verbose by changing log4j.properties to DEBUG, but still had not file
generated.

Any help with this would help me move forward with nutch.

Regards
Hemant

Questions/issues with nutch

Reply via email to