The quick responses flowing are very encouraging. Thanks Tejas. Tejas, as I mentioned earlier, in fact I actually ran it step by step.
So first I ran the inject command and then the readdb with dump option and did not see anything in the dump files, that leads me to say that the inject did not work.I verified the regex-urlfilter and made sure that my url is not getting filtered. I agree that the second link is about configuring HBase as a storageDB. However, I do not have Hbase installed and dont foresee getting it installed any sooner, hence using HBase for storage is not a option, so I am going to have to stick to Gora with memory store. On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <[email protected]>wrote: > On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote: > > > Thanks for the response Lewis. > > I did read these links, I mostly followed the first link and tried both > the > > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer exception > on > > solr, so I figured that I should first deal with getting the crawl part > to > > work and then deal with solr indexing. Hence I went back to trying it > > stepwise. > > > > You should try running the crawl using individual commands and see where > the problem is. The nutch tutorial which Lewis pointed you to had those > commands. Even peeking into the bin/crawl script would also help as it > calls the nutch commands. > > > > > As for the second link, it is more about using HBase as store instead of > > gora. This is not really a option for me yet, cause my grid does not have > > hbase installed yet. Getting it done is not much under my control > > > > HBase is one of the datastores supported by Apache Gora. That tutorial > speaks about how to configure Nutch (actually Gora) to use HBase as a > backend. So, its wrong to say that the tutorial was about HBase and not > Gora. > > > > > the FAQ link is the one I had not gone through until I checked your > > response, but I do not find answers to any of my questions > > (directly/indirectly) in it. > > > > Ok > > > > > > > > > > > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney < > > [email protected]> wrote: > > > > > Hi Hemant, > > > I strongly advise you to take some time to look through the Nutch > > Tutorial > > > for 1.x and 2.x. > > > http://wiki.apache.org/nutch/NutchTutorial > > > http://wiki.apache.org/nutch/Nutch2Tutorial > > > Also please see the FAQ's, which you will find very very useful. > > > http://wiki.apache.org/nutch/FAQ > > > > > > Thanks > > > Lewis > > > > > > > > > On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote: > > > > > > > Hi, > > > > I am first time user of nutch. I installed > > > > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single > > > > webpage. > > > > > > > > I am running nutch step by step. These are the problems I came > across - > > > > > > > > 1. Inject did not work, i..e the url does not reflect in the > > > > webdb(gora-memstore). The way I verify this is after running inject, > i > > > run > > > > readdb with dump. This created a directory in hdfs with 0 size part > > file. > > > > > > > > 2. config files - This confused me a lot. When run from deploy > > directory, > > > > does nutch use the config files from local/conf? Changes made to > > > > local/conf/nutch-site.xml did not take effect after editing this > file. > > I > > > > had to edit this in order to get rid of the 'http.agent.name' > error. I > > > > finally ended up hard-coding this in the code, rebuilding and running > > to > > > > keep going forward. > > > > > > > > 3. how to interpret readdb - Running readdb -stats, shows a lot out > > > output > > > > but I do not see my url from seed.txt in there. So I do not know if > the > > > > entry in webdb actually reflects my seed.txt at all or not. > > > > > > > > 4. logs - When nutch is run from the deploy directory, the > > > logs/hadoop.log > > > > is not generated anymore, not locally, nor on the grid. I tried to > make > > > it > > > > verbose by changing log4j.properties to DEBUG, but still had not file > > > > generated. > > > > > > > > Any help with this would help me move forward with nutch. > > > > > > > > Regards > > > > Hemant > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > >

