Re: Questions/issues with nutch

h b Wed, 26 Jun 2013 23:38:24 -0700

Here is an example of what I am saying about the config changes not taking
effect.


cd runtime/deploy
cat ../local/conf/nutch-site.xml
......

  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.avro.store.AvroStore</value>
  </property>
.....

cd ../..

ant job

cd runtime/deploy
bin/nutch inject urls -crawlId crawl1
.....
13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class
org.apache.gora.memory.store.MemStore as the Gora storage class.
.....

So the nutch-site.xml was changed to use AvroStore as storage class and job
was rebuilt, and I reran inject, the output of which still shows that it is
trying to use Memstore.








On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> The Gora MemStore was introduced to deal predominantly with test scenarios.
> This is justified as the 2.x code is pulled nightly and after every commit
> and tested.
> It is nnot thread safe and should not be used (until we fix some issues)
> for any kind of serious deployment.
> From your inject task on the job tracker, you will be able to see
> 'urls_injected' counters which represent the number of urls actually
> persisted through Gora into the datastore.
> I understand that HBase is not an option. Gora should also support writing
> the output into Avro sequence files... which can be pumped into hdfs. We
> have done some work on this so I suppose that right now is as good a time
> as any for you to try it out.
> use the default datastore as org.apache.gora.avro.store.AvroStore I think.
> You can double check by looking into gora.properties
> As a note, youu should use nutch-site.xml within the top level conf
> directory for all your Nutch configuration. You should then create a new
> job jar for use in hadoop by calling 'ant job' after the changes are made.
> hth
> Lewis
>
> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
> > The quick responses flowing are very encouraging. Thanks Tejas.
> > Tejas, as I mentioned earlier, in fact I actually ran it step by step.
> >
> > So first I ran the inject command and then the readdb with dump option
> and
> > did not see anything in the dump files, that leads me to say that the
> > inject did not work.I verified the regex-urlfilter and made sure that my
> > url is not getting filtered.
> >
> > I agree that the second link is about configuring HBase as a storageDB.
> > However, I do not have Hbase installed and dont foresee getting it
> > installed any sooner, hence using HBase for storage is not a option, so I
> > am going to have to stick to Gora with memory store.
> >
> >
> >
> >
> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <[email protected]
> >wrote:
> >
> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote:
> >>
> >> > Thanks for the response Lewis.
> >> > I did read these links, I mostly followed the first link and tried
> both
> >> the
> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
> exception
> >> on
> >> > solr, so I figured that I should first deal with getting the crawl
> part
> >> to
> >> > work and then deal with solr indexing. Hence I went back to trying it
> >> > stepwise.
> >> >
> >>
> >> You should try running the crawl using individual commands and see where
> >> the problem is. The nutch tutorial which Lewis pointed you to had those
> >> commands. Even peeking into the bin/crawl script would also help as it
> >> calls the nutch commands.
> >>
> >> >
> >> > As for the second link, it is more about using HBase as store instead
> of
> >> > gora. This is not really a option for me yet, cause my grid does not
> have
> >> > hbase installed yet. Getting it done is not much under my control
> >> >
> >>
> >> HBase is one of the datastores supported by Apache Gora. That tutorial
> >> speaks about how to configure Nutch (actually Gora) to use HBase as a
> >> backend. So, its wrong to say that the tutorial was about HBase and not
> >> Gora.
> >>
> >> >
> >> > the FAQ link is the one I had not gone through until I checked your
> >> > response, but I do not find answers to any of my questions
> >> > (directly/indirectly) in it.
> >> >
> >>
> >> Ok
> >>
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
> >> > [email protected]> wrote:
> >> >
> >> > > Hi Hemant,
> >> > > I strongly advise you to take some time to look through the Nutch
> >> > Tutorial
> >> > > for 1.x and 2.x.
> >> > > http://wiki.apache.org/nutch/NutchTutorial
> >> > > http://wiki.apache.org/nutch/Nutch2Tutorial
> >> > > Also please see the FAQ's, which you will find very very useful.
> >> > > http://wiki.apache.org/nutch/FAQ
> >> > >
> >> > > Thanks
> >> > > Lewis
> >> > >
> >> > >
> >> > > On Wed, Jun 26, 2013 at 5:18 PM, h b <[email protected]> wrote:
> >> > >
> >> > > > Hi,
> >> > > > I am first time user of nutch. I installed
> >> > > > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a
> single
> >> > > > webpage.
> >> > > >
> >> > > > I am running nutch step by step. These are the problems I came
> >> across -
> >> > > >
> >> > > > 1. Inject did not work, i..e the url does not reflect in the
> >> > > > webdb(gora-memstore). The way I verify this is after running
> inject,
> >> i
> >> > > run
> >> > > > readdb with dump. This created a directory in hdfs with 0 size
> part
> >> > file.
> >> > > >
> >> > > > 2. config files - This confused me a lot. When run from deploy
> >> > directory,
> >> > > > does nutch use the config files from local/conf? Changes made to
> >> > > > local/conf/nutch-site.xml did not take effect after editing this
> >> file.
> >> > I
> >> > > > had to edit this in order to get rid of the 'http.agent.name'
> >> error. I
> >> > > > finally ended up hard-coding this in the code, rebuilding and
> running
> >> > to
> >> > > > keep going forward.
> >> > > >
> >> > > > 3. how to interpret readdb - Running readdb -stats, shows a lot
> out
> >> > > output
> >> > > > but I do not see my url from seed.txt in there. So I do not know
> if
> >> the
> >> > > > entry in webdb actually reflects my seed.txt at all or not.
> >> > > >
> >> > > > 4. logs - When nutch is run from the deploy directory, the
> >> > > logs/hadoop.log
> >> > > > is not generated anymore, not locally, nor on the grid. I tried to
> >> make
> >> > >
>
> --
> *Lewis*
>

Re: Questions/issues with nutch

Reply via email to