Thanks for the response Lewis. I did read these links, I mostly followed the first link and tried both the 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer exception on solr, so I figured that I should first deal with getting the crawl part to work and then deal with solr indexing. Hence I went back to trying it stepwise.
As for the second link, it is more about using HBase as store instead of gora. This is not really a option for me yet, cause my grid does not have hbase installed yet. Getting it done is not much under my control the FAQ link is the one I had not gone through until I checked your response, but I do not find answers to any of my questions (directly/indirectly) in it. On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Hemant, > I strongly advise you to take some time to look through the Nutch Tutorial > for 1.x and 2.x. > http://wiki.apache.org/nutch/NutchTutorial > http://wiki.apache.org/nutch/Nutch2Tutorial > Also please see the FAQ's, which you will find very very useful. > http://wiki.apache.org/nutch/FAQ > > Thanks > Lewis > > > On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote: > > > Hi, > > I am first time user of nutch. I installed > > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single > > webpage. > > > > I am running nutch step by step. These are the problems I came across - > > > > 1. Inject did not work, i..e the url does not reflect in the > > webdb(gora-memstore). The way I verify this is after running inject, i > run > > readdb with dump. This created a directory in hdfs with 0 size part file. > > > > 2. config files - This confused me a lot. When run from deploy directory, > > does nutch use the config files from local/conf? Changes made to > > local/conf/nutch-site.xml did not take effect after editing this file. I > > had to edit this in order to get rid of the 'http.agent.name' error. I > > finally ended up hard-coding this in the code, rebuilding and running to > > keep going forward. > > > > 3. how to interpret readdb - Running readdb -stats, shows a lot out > output > > but I do not see my url from seed.txt in there. So I do not know if the > > entry in webdb actually reflects my seed.txt at all or not. > > > > 4. logs - When nutch is run from the deploy directory, the > logs/hadoop.log > > is not generated anymore, not locally, nor on the grid. I tried to make > it > > verbose by changing log4j.properties to DEBUG, but still had not file > > generated. > > > > Any help with this would help me move forward with nutch. > > > > Regards > > Hemant > > > > > > -- > *Lewis* >

