What is the datastore in gora.properties ? http://wiki.apache.org/nutch/Nutch2Tutorial
On Wed, Jun 26, 2013 at 11:37 PM, h b <[email protected]> wrote: > Here is an example of what I am saying about the config changes not taking > effect. > > cd runtime/deploy > cat ../local/conf/nutch-site.xml > ...... > > <property> > <name>storage.data.store.class</name> > <value>org.apache.gora.avro.store.AvroStore</value> > </property> > ..... > > cd ../.. > > ant job > > cd runtime/deploy > bin/nutch inject urls -crawlId crawl1 > ..... > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class > org.apache.gora.memory.store.MemStore as the Gora storage class. > ..... > > So the nutch-site.xml was changed to use AvroStore as storage class and job > was rebuilt, and I reran inject, the output of which still shows that it is > trying to use Memstore. > > > > > > > > > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney < > [email protected]> wrote: > > > The Gora MemStore was introduced to deal predominantly with test > scenarios. > > This is justified as the 2.x code is pulled nightly and after every > commit > > and tested. > > It is nnot thread safe and should not be used (until we fix some issues) > > for any kind of serious deployment. > > From your inject task on the job tracker, you will be able to see > > 'urls_injected' counters which represent the number of urls actually > > persisted through Gora into the datastore. > > I understand that HBase is not an option. Gora should also support > writing > > the output into Avro sequence files... which can be pumped into hdfs. We > > have done some work on this so I suppose that right now is as good a time > > as any for you to try it out. > > use the default datastore as org.apache.gora.avro.store.AvroStore I > think. > > You can double check by looking into gora.properties > > As a note, youu should use nutch-site.xml within the top level conf > > directory for all your Nutch configuration. You should then create a new > > job jar for use in hadoop by calling 'ant job' after the changes are > made. > > hth > > Lewis > > > > On Wednesday, June 26, 2013, h b <[email protected]> wrote: > > > The quick responses flowing are very encouraging. Thanks Tejas. > > > Tejas, as I mentioned earlier, in fact I actually ran it step by step. > > > > > > So first I ran the inject command and then the readdb with dump option > > and > > > did not see anything in the dump files, that leads me to say that the > > > inject did not work.I verified the regex-urlfilter and made sure that > my > > > url is not getting filtered. > > > > > > I agree that the second link is about configuring HBase as a storageDB. > > > However, I do not have Hbase installed and dont foresee getting it > > > installed any sooner, hence using HBase for storage is not a option, > so I > > > am going to have to stick to Gora with memory store. > > > > > > > > > > > > > > > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil < > [email protected] > > >wrote: > > > > > >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote: > > >> > > >> > Thanks for the response Lewis. > > >> > I did read these links, I mostly followed the first link and tried > > both > > >> the > > >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer > > exception > > >> on > > >> > solr, so I figured that I should first deal with getting the crawl > > part > > >> to > > >> > work and then deal with solr indexing. Hence I went back to trying > it > > >> > stepwise. > > >> > > > >> > > >> You should try running the crawl using individual commands and see > where > > >> the problem is. The nutch tutorial which Lewis pointed you to had > those > > >> commands. Even peeking into the bin/crawl script would also help as it > > >> calls the nutch commands. > > >> > > >> > > > >> > As for the second link, it is more about using HBase as store > instead > > of > > >> > gora. This is not really a option for me yet, cause my grid does not > > have > > >> > hbase installed yet. Getting it done is not much under my control > > >> > > > >> > > >> HBase is one of the datastores supported by Apache Gora. That tutorial > > >> speaks about how to configure Nutch (actually Gora) to use HBase as a > > >> backend. So, its wrong to say that the tutorial was about HBase and > not > > >> Gora. > > >> > > >> > > > >> > the FAQ link is the one I had not gone through until I checked your > > >> > response, but I do not find answers to any of my questions > > >> > (directly/indirectly) in it. > > >> > > > >> > > >> Ok > > >> > > >> > > > >> > > > >> > > > >> > > > >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney < > > >> > [email protected]> wrote: > > >> > > > >> > > Hi Hemant, > > >> > > I strongly advise you to take some time to look through the Nutch > > >> > Tutorial > > >> > > for 1.x and 2.x. > > >> > > http://wiki.apache.org/nutch/NutchTutorial > > >> > > http://wiki.apache.org/nutch/Nutch2Tutorial > > >> > > Also please see the FAQ's, which you will find very very useful. > > >> > > http://wiki.apache.org/nutch/FAQ > > >> > > > > >> > > Thanks > > >> > > Lewis > > >> > > > > >> > > > > >> > > On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote: > > >> > > > > >> > > > Hi, > > >> > > > I am first time user of nutch. I installed > > >> > > > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a > > single > > >> > > > webpage. > > >> > > > > > >> > > > I am running nutch step by step. These are the problems I came > > >> across - > > >> > > > > > >> > > > 1. Inject did not work, i..e the url does not reflect in the > > >> > > > webdb(gora-memstore). The way I verify this is after running > > inject, > > >> i > > >> > > run > > >> > > > readdb with dump. This created a directory in hdfs with 0 size > > part > > >> > file. > > >> > > > > > >> > > > 2. config files - This confused me a lot. When run from deploy > > >> > directory, > > >> > > > does nutch use the config files from local/conf? Changes made to > > >> > > > local/conf/nutch-site.xml did not take effect after editing this > > >> file. > > >> > I > > >> > > > had to edit this in order to get rid of the 'http.agent.name' > > >> error. I > > >> > > > finally ended up hard-coding this in the code, rebuilding and > > running > > >> > to > > >> > > > keep going forward. > > >> > > > > > >> > > > 3. how to interpret readdb - Running readdb -stats, shows a lot > > out > > >> > > output > > >> > > > but I do not see my url from seed.txt in there. So I do not know > > if > > >> the > > >> > > > entry in webdb actually reflects my seed.txt at all or not. > > >> > > > > > >> > > > 4. logs - When nutch is run from the deploy directory, the > > >> > > logs/hadoop.log > > >> > > > is not generated anymore, not locally, nor on the grid. I tried > to > > >> make > > >> > > > > > > -- > > *Lewis* > > >

