Hi Hemant, I strongly advise you to take some time to look through the Nutch Tutorial for 1.x and 2.x. http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/Nutch2Tutorial Also please see the FAQ's, which you will find very very useful. http://wiki.apache.org/nutch/FAQ
Thanks Lewis On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote: > Hi, > I am first time user of nutch. I installed > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single > webpage. > > I am running nutch step by step. These are the problems I came across - > > 1. Inject did not work, i..e the url does not reflect in the > webdb(gora-memstore). The way I verify this is after running inject, i run > readdb with dump. This created a directory in hdfs with 0 size part file. > > 2. config files - This confused me a lot. When run from deploy directory, > does nutch use the config files from local/conf? Changes made to > local/conf/nutch-site.xml did not take effect after editing this file. I > had to edit this in order to get rid of the 'http.agent.name' error. I > finally ended up hard-coding this in the code, rebuilding and running to > keep going forward. > > 3. how to interpret readdb - Running readdb -stats, shows a lot out output > but I do not see my url from seed.txt in there. So I do not know if the > entry in webdb actually reflects my seed.txt at all or not. > > 4. logs - When nutch is run from the deploy directory, the logs/hadoop.log > is not generated anymore, not locally, nor on the grid. I tried to make it > verbose by changing log4j.properties to DEBUG, but still had not file > generated. > > Any help with this would help me move forward with nutch. > > Regards > Hemant > -- *Lewis*

