Hi Lewis, Thanks for details. One quickie: By using memstore as the datastore, will the results be persisted across runs ? I mean, after injecting stuff, where would the crawl datums get stored on to the disk so that the generate phase gets those ? I believe that memstore won't do it and would give up everything once the process ends.
On Wed, Jun 26, 2013 at 11:06 PM, Tejas Patil <[email protected]>wrote: > On Wed, Jun 26, 2013 at 10:26 PM, h b <[email protected]> wrote: > >> The quick responses flowing are very encouraging. Thanks Tejas. >> Tejas, as I mentioned earlier, in fact I actually ran it step by step. >> >> So first I ran the inject command and then the readdb with dump option and >> did not see anything in the dump files, that leads me to say that the >> inject did not work.I verified the regex-urlfilter and made sure that my >> url is not getting filtered. >> >> and you see nothing interesting in the logs. Oh boy... If this happens > w/o any config changes over the distribution (apart from http.agent.name), > then it should have been reported by now. You might set the loggers to > lower level to get more details. I have a feeling that mostly the reason is > the datastore used is buggy. > > I agree that the second link is about configuring HBase as a storageDB. >> However, I do not have Hbase installed and dont foresee getting it >> installed any sooner, hence using HBase for storage is not a option, so I >> am going to have to stick to Gora with memory store. >> >> Ok. There were Jiras logged regarding memory store not working correctly > (it was in reference to junits being failing). Lewis / Renato might have > more knowledge about it. Being honest, I doubt it anybody .. out there .. > is actually using memstore. HBase seems to be the most cheered backend. > >> >> >> >> On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <[email protected] >> >wrote: >> >> > On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote: >> > >> > > Thanks for the response Lewis. >> > > I did read these links, I mostly followed the first link and tried >> both >> > the >> > > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer >> exception >> > on >> > > solr, so I figured that I should first deal with getting the crawl >> part >> > to >> > > work and then deal with solr indexing. Hence I went back to trying it >> > > stepwise. >> > > >> > >> > You should try running the crawl using individual commands and see where >> > the problem is. The nutch tutorial which Lewis pointed you to had those >> > commands. Even peeking into the bin/crawl script would also help as it >> > calls the nutch commands. >> > >> > > >> > > As for the second link, it is more about using HBase as store instead >> of >> > > gora. This is not really a option for me yet, cause my grid does not >> have >> > > hbase installed yet. Getting it done is not much under my control >> > > >> > >> > HBase is one of the datastores supported by Apache Gora. That tutorial >> > speaks about how to configure Nutch (actually Gora) to use HBase as a >> > backend. So, its wrong to say that the tutorial was about HBase and not >> > Gora. >> > >> > > >> > > the FAQ link is the one I had not gone through until I checked your >> > > response, but I do not find answers to any of my questions >> > > (directly/indirectly) in it. >> > > >> > >> > Ok >> > >> > > >> > > >> > > >> > > >> > > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney < >> > > [email protected]> wrote: >> > > >> > > > Hi Hemant, >> > > > I strongly advise you to take some time to look through the Nutch >> > > Tutorial >> > > > for 1.x and 2.x. >> > > > http://wiki.apache.org/nutch/NutchTutorial >> > > > http://wiki.apache.org/nutch/Nutch2Tutorial >> > > > Also please see the FAQ's, which you will find very very useful. >> > > > http://wiki.apache.org/nutch/FAQ >> > > > >> > > > Thanks >> > > > Lewis >> > > > >> > > > >> > > > On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote: >> > > > >> > > > > Hi, >> > > > > I am first time user of nutch. I installed >> > > > > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a >> single >> > > > > webpage. >> > > > > >> > > > > I am running nutch step by step. These are the problems I came >> > across - >> > > > > >> > > > > 1. Inject did not work, i..e the url does not reflect in the >> > > > > webdb(gora-memstore). The way I verify this is after running >> inject, >> > i >> > > > run >> > > > > readdb with dump. This created a directory in hdfs with 0 size >> part >> > > file. >> > > > > >> > > > > 2. config files - This confused me a lot. When run from deploy >> > > directory, >> > > > > does nutch use the config files from local/conf? Changes made to >> > > > > local/conf/nutch-site.xml did not take effect after editing this >> > file. >> > > I >> > > > > had to edit this in order to get rid of the 'http.agent.name' >> > error. I >> > > > > finally ended up hard-coding this in the code, rebuilding and >> running >> > > to >> > > > > keep going forward. >> > > > > >> > > > > 3. how to interpret readdb - Running readdb -stats, shows a lot >> out >> > > > output >> > > > > but I do not see my url from seed.txt in there. So I do not know >> if >> > the >> > > > > entry in webdb actually reflects my seed.txt at all or not. >> > > > > >> > > > > 4. logs - When nutch is run from the deploy directory, the >> > > > logs/hadoop.log >> > > > > is not generated anymore, not locally, nor on the grid. I tried to >> > make >> > > > it >> > > > > verbose by changing log4j.properties to DEBUG, but still had not >> file >> > > > > generated. >> > > > > >> > > > > Any help with this would help me move forward with nutch. >> > > > > >> > > > > Regards >> > > > > Hemant >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > *Lewis* >> > > > >> > > >> > >> > >

