Hi, I am first time user of nutch. I installed nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single webpage.
I am running nutch step by step. These are the problems I came across - 1. Inject did not work, i..e the url does not reflect in the webdb(gora-memstore). The way I verify this is after running inject, i run readdb with dump. This created a directory in hdfs with 0 size part file. 2. config files - This confused me a lot. When run from deploy directory, does nutch use the config files from local/conf? Changes made to local/conf/nutch-site.xml did not take effect after editing this file. I had to edit this in order to get rid of the 'http.agent.name' error. I finally ended up hard-coding this in the code, rebuilding and running to keep going forward. 3. how to interpret readdb - Running readdb -stats, shows a lot out output but I do not see my url from seed.txt in there. So I do not know if the entry in webdb actually reflects my seed.txt at all or not. 4. logs - When nutch is run from the deploy directory, the logs/hadoop.log is not generated anymore, not locally, nor on the grid. I tried to make it verbose by changing log4j.properties to DEBUG, but still had not file generated. Any help with this would help me move forward with nutch. Regards Hemant

