On Wed, Jun 26, 2013 at 10:26 PM, h b <[email protected]> wrote: > The quick responses flowing are very encouraging. Thanks Tejas. > Tejas, as I mentioned earlier, in fact I actually ran it step by step. > > So first I ran the inject command and then the readdb with dump option and > did not see anything in the dump files, that leads me to say that the > inject did not work.I verified the regex-urlfilter and made sure that my > url is not getting filtered. > > and you see nothing interesting in the logs. Oh boy... If this happens w/o any config changes over the distribution (apart from http.agent.name), then it should have been reported by now. You might set the loggers to lower level to get more details. I have a feeling that mostly the reason is the datastore used is buggy.
I agree that the second link is about configuring HBase as a storageDB. > However, I do not have Hbase installed and dont foresee getting it > installed any sooner, hence using HBase for storage is not a option, so I > am going to have to stick to Gora with memory store. > > Ok. There were Jiras logged regarding memory store not working correctly (it was in reference to junits being failing). Lewis / Renato might have more knowledge about it. Being honest, I doubt it anybody .. out there .. is actually using memstore. HBase seems to be the most cheered backend. > > > > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <[email protected] > >wrote: > > > On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote: > > > > > Thanks for the response Lewis. > > > I did read these links, I mostly followed the first link and tried both > > the > > > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer > exception > > on > > > solr, so I figured that I should first deal with getting the crawl part > > to > > > work and then deal with solr indexing. Hence I went back to trying it > > > stepwise. > > > > > > > You should try running the crawl using individual commands and see where > > the problem is. The nutch tutorial which Lewis pointed you to had those > > commands. Even peeking into the bin/crawl script would also help as it > > calls the nutch commands. > > > > > > > > As for the second link, it is more about using HBase as store instead > of > > > gora. This is not really a option for me yet, cause my grid does not > have > > > hbase installed yet. Getting it done is not much under my control > > > > > > > HBase is one of the datastores supported by Apache Gora. That tutorial > > speaks about how to configure Nutch (actually Gora) to use HBase as a > > backend. So, its wrong to say that the tutorial was about HBase and not > > Gora. > > > > > > > > the FAQ link is the one I had not gone through until I checked your > > > response, but I do not find answers to any of my questions > > > (directly/indirectly) in it. > > > > > > > Ok > > > > > > > > > > > > > > > > > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney < > > > [email protected]> wrote: > > > > > > > Hi Hemant, > > > > I strongly advise you to take some time to look through the Nutch > > > Tutorial > > > > for 1.x and 2.x. > > > > http://wiki.apache.org/nutch/NutchTutorial > > > > http://wiki.apache.org/nutch/Nutch2Tutorial > > > > Also please see the FAQ's, which you will find very very useful. > > > > http://wiki.apache.org/nutch/FAQ > > > > > > > > Thanks > > > > Lewis > > > > > > > > > > > > On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote: > > > > > > > > > Hi, > > > > > I am first time user of nutch. I installed > > > > > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single > > > > > webpage. > > > > > > > > > > I am running nutch step by step. These are the problems I came > > across - > > > > > > > > > > 1. Inject did not work, i..e the url does not reflect in the > > > > > webdb(gora-memstore). The way I verify this is after running > inject, > > i > > > > run > > > > > readdb with dump. This created a directory in hdfs with 0 size part > > > file. > > > > > > > > > > 2. config files - This confused me a lot. When run from deploy > > > directory, > > > > > does nutch use the config files from local/conf? Changes made to > > > > > local/conf/nutch-site.xml did not take effect after editing this > > file. > > > I > > > > > had to edit this in order to get rid of the 'http.agent.name' > > error. I > > > > > finally ended up hard-coding this in the code, rebuilding and > running > > > to > > > > > keep going forward. > > > > > > > > > > 3. how to interpret readdb - Running readdb -stats, shows a lot out > > > > output > > > > > but I do not see my url from seed.txt in there. So I do not know if > > the > > > > > entry in webdb actually reflects my seed.txt at all or not. > > > > > > > > > > 4. logs - When nutch is run from the deploy directory, the > > > > logs/hadoop.log > > > > > is not generated anymore, not locally, nor on the grid. I tried to > > make > > > > it > > > > > verbose by changing log4j.properties to DEBUG, but still had not > file > > > > > generated. > > > > > > > > > > Any help with this would help me move forward with nutch. > > > > > > > > > > Regards > > > > > Hemant > > > > > > > > > > > > > > > > > > > > > -- > > > > *Lewis* > > > > > > > > > >

