Re: Questions/issues with nutch

Tejas Patil Thu, 27 Jun 2013 05:09:33 -0700

What is the datastore in gora.properties ?

http://wiki.apache.org/nutch/Nutch2Tutorial



On Wed, Jun 26, 2013 at 11:37 PM, h b <[email protected]> wrote:

> Here is an example of what I am saying about the config changes not taking
> effect.
>
> cd runtime/deploy
> cat ../local/conf/nutch-site.xml
> ......
>
>   <property>
>     <name>storage.data.store.class</name>
>     <value>org.apache.gora.avro.store.AvroStore</value>
>   </property>
> .....
>
> cd ../..
>
> ant job
>
> cd runtime/deploy
> bin/nutch inject urls -crawlId crawl1
> .....
> 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class
> org.apache.gora.memory.store.MemStore as the Gora storage class.
> .....
>
> So the nutch-site.xml was changed to use AvroStore as storage class and job
> was rebuilt, and I reran inject, the output of which still shows that it is
> trying to use Memstore.
>
>
>
>
>
>
>
>
> On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
> > The Gora MemStore was introduced to deal predominantly with test
> scenarios.
> > This is justified as the 2.x code is pulled nightly and after every
> commit
> > and tested.
> > It is nnot thread safe and should not be used (until we fix some issues)
> > for any kind of serious deployment.
> > From your inject task on the job tracker, you will be able to see
> > 'urls_injected' counters which represent the number of urls actually
> > persisted through Gora into the datastore.
> > I understand that HBase is not an option. Gora should also support
> writing
> > the output into Avro sequence files... which can be pumped into hdfs. We
> > have done some work on this so I suppose that right now is as good a time
> > as any for you to try it out.
> > use the default datastore as org.apache.gora.avro.store.AvroStore I
> think.
> > You can double check by looking into gora.properties
> > As a note, youu should use nutch-site.xml within the top level conf
> > directory for all your Nutch configuration. You should then create a new
> > job jar for use in hadoop by calling 'ant job' after the changes are
> made.
> > hth
> > Lewis
> >
> > On Wednesday, June 26, 2013, h b <[email protected]> wrote:
> > > The quick responses flowing are very encouraging. Thanks Tejas.
> > > Tejas, as I mentioned earlier, in fact I actually ran it step by step.
> > >
> > > So first I ran the inject command and then the readdb with dump option
> > and
> > > did not see anything in the dump files, that leads me to say that the
> > > inject did not work.I verified the regex-urlfilter and made sure that
> my
> > > url is not getting filtered.
> > >
> > > I agree that the second link is about configuring HBase as a storageDB.
> > > However, I do not have Hbase installed and dont foresee getting it
> > > installed any sooner, hence using HBase for storage is not a option,
> so I
> > > am going to have to stick to Gora with memory store.
> > >
> > >
> > >
> > >
> > > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <
> [email protected]
> > >wrote:
> > >
> > >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote:
> > >>
> > >> > Thanks for the response Lewis.
> > >> > I did read these links, I mostly followed the first link and tried
> > both
> > >> the
> > >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
> > exception
> > >> on
> > >> > solr, so I figured that I should first deal with getting the crawl
> > part
> > >> to
> > >> > work and then deal with solr indexing. Hence I went back to trying
> it
> > >> > stepwise.
> > >> >
> > >>
> > >> You should try running the crawl using individual commands and see
> where
> > >> the problem is. The nutch tutorial which Lewis pointed you to had
> those
> > >> commands. Even peeking into the bin/crawl script would also help as it
> > >> calls the nutch commands.
> > >>
> > >> >
> > >> > As for the second link, it is more about using HBase as store
> instead
> > of
> > >> > gora. This is not really a option for me yet, cause my grid does not
> > have
> > >> > hbase installed yet. Getting it done is not much under my control
> > >> >
> > >>
> > >> HBase is one of the datastores supported by Apache Gora. That tutorial
> > >> speaks about how to configure Nutch (actually Gora) to use HBase as a
> > >> backend. So, its wrong to say that the tutorial was about HBase and
> not
> > >> Gora.
> > >>
> > >> >
> > >> > the FAQ link is the one I had not gone through until I checked your
> > >> > response, but I do not find answers to any of my questions
> > >> > (directly/indirectly) in it.
> > >> >
> > >>
> > >> Ok
> > >>
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
> > >> > [email protected]> wrote:
> > >> >
> > >> > > Hi Hemant,
> > >> > > I strongly advise you to take some time to look through the Nutch
> > >> > Tutorial
> > >> > > for 1.x and 2.x.
> > >> > > http://wiki.apache.org/nutch/NutchTutorial
> > >> > > http://wiki.apache.org/nutch/Nutch2Tutorial
> > >> > > Also please see the FAQ's, which you will find very very useful.
> > >> > > http://wiki.apache.org/nutch/FAQ
> > >> > >
> > >> > > Thanks
> > >> > > Lewis
> > >> > >
> > >> > >
> > >> > > On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > > I am first time user of nutch. I installed
> > >> > > > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a
> > single
> > >> > > > webpage.
> > >> > > >
> > >> > > > I am running nutch step by step. These are the problems I came
> > >> across -
> > >> > > >
> > >> > > > 1. Inject did not work, i..e the url does not reflect in the
> > >> > > > webdb(gora-memstore). The way I verify this is after running
> > inject,
> > >> i
> > >> > > run
> > >> > > > readdb with dump. This created a directory in hdfs with 0 size
> > part
> > >> > file.
> > >> > > >
> > >> > > > 2. config files - This confused me a lot. When run from deploy
> > >> > directory,
> > >> > > > does nutch use the config files from local/conf? Changes made to
> > >> > > > local/conf/nutch-site.xml did not take effect after editing this
> > >> file.
> > >> > I
> > >> > > > had to edit this in order to get rid of the 'http.agent.name'
> > >> error. I
> > >> > > > finally ended up hard-coding this in the code, rebuilding and
> > running
> > >> > to
> > >> > > > keep going forward.
> > >> > > >
> > >> > > > 3. how to interpret readdb - Running readdb -stats, shows a lot
> > out
> > >> > > output
> > >> > > > but I do not see my url from seed.txt in there. So I do not know
> > if
> > >> the
> > >> > > > entry in webdb actually reflects my seed.txt at all or not.
> > >> > > >
> > >> > > > 4. logs - When nutch is run from the deploy directory, the
> > >> > > logs/hadoop.log
> > >> > > > is not generated anymore, not locally, nor on the grid. I tried
> to
> > >> make
> > >> > >
> >
> > --
> > *Lewis*
> >
>

Re: Questions/issues with nutch

Reply via email to