Hi Lewis,
Sorry for missing that one. So I update the top level conf and rebuild the
job.
cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml
......
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.avro.store.AvroStore</value>
</property>
......
cd ~/nutch/apache-nutch-2.2/
ant job
cd ~/nutch/apache-nutch-2.2/runtime/deploy/
bin/nutch inject urls -crawlId crawl1
13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: starting at
2013-06-27 17:12:01
13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir:
urls
13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Using class
org.apache.gora.memory.store.MemStore as the Gora storage class.
It still shows me MemStore.
In the jobtracker I see a [crawl1]inject urls job does not have
urls_injected property
I have a *db.score.injected* 1.0, but dont think that is anything to say
about urls injected.
On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney <
[email protected]> wrote:
> Hi,
> Please re-read my mail.
> If you are using the deploy directory e.g. running on a hadoop cluster,
> then make sure to edit nutch-site.xml from within the top level conf
> directory _not_ the conf directory in runtime/local.
> If you look at the ant runtime target in the build script you will see the
> code which generates the runtime directory structure.
> Make changes to conf/nutch-site.xml, build the job jar, navigate to
> runtime/deploy, run the code.
> It's easier to make the job jar and scripts in deploy available to the job
> tracker.
> You also didn't comment on the counters for the inject job. Do you see any?
> Best
> Lewis
>
> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
> > Here is an example of what I am saying about the config changes not
> taking
> > effect.
> >
> > cd runtime/deploy
> > cat ../local/conf/nutch-site.xml
> > ......
> >
> > <property>
> > <name>storage.data.store.class</name>
> > <value>org.apache.gora.avro.store.AvroStore</value>
> > </property>
> > .....
> >
> > cd ../..
> >
> > ant job
> >
> > cd runtime/deploy
> > bin/nutch inject urls -crawlId crawl1
> > .....
> > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class
> > org.apache.gora.memory.store.MemStore as the Gora storage class.
> > .....
> >
> > So the nutch-site.xml was changed to use AvroStore as storage class and
> job
> > was rebuilt, and I reran inject, the output of which still shows that it
> is
> > trying to use Memstore.
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> The Gora MemStore was introduced to deal predominantly with test
> scenarios.
> >> This is justified as the 2.x code is pulled nightly and after every
> commit
> >> and tested.
> >> It is nnot thread safe and should not be used (until we fix some issues)
> >> for any kind of serious deployment.
> >> From your inject task on the job tracker, you will be able to see
> >> 'urls_injected' counters which represent the number of urls actually
> >> persisted through Gora into the datastore.
> >> I understand that HBase is not an option. Gora should also support
> writing
> >> the output into Avro sequence files... which can be pumped into hdfs. We
> >> have done some work on this so I suppose that right now is as good a
> time
> >> as any for you to try it out.
> >> use the default datastore as org.apache.gora.avro.store.AvroStore I
> think.
> >> You can double check by looking into gora.properties
> >> As a note, youu should use nutch-site.xml within the top level conf
> >> directory for all your Nutch configuration. You should then create a new
> >> job jar for use in hadoop by calling 'ant job' after the changes are
> made.
> >> hth
> >> Lewis
> >>
> >> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
> >> > The quick responses flowing are very encouraging. Thanks Tejas.
> >> > Tejas, as I mentioned earlier, in fact I actually ran it step by step.
> >> >
> >> > So first I ran the inject command and then the readdb with dump option
> >> and
> >> > did not see anything in the dump files, that leads me to say that the
> >> > inject did not work.I verified the regex-urlfilter and made sure that
> my
> >> > url is not getting filtered.
> >> >
> >> > I agree that the second link is about configuring HBase as a
> storageDB.
> >> > However, I do not have Hbase installed and dont foresee getting it
> >> > installed any sooner, hence using HBase for storage is not a option,
> so I
> >> > am going to have to stick to Gora with memory store.
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <
> [email protected]
> >> >wrote:
> >> >
> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote:
> >> >>
> >> >> > Thanks for the response Lewis.
> >> >> > I did read these links, I mostly followed the first link and tried
> >> both
> >> >> the
> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
> >> exception
> >> >> on
> >> >> > solr, so I figured that I should first deal with getting the crawl
> >> part
> >> >> to
> >> >> > work and then deal with solr indexing. Hence I went back to trying
> it
> >> >> > stepwise.
> >> >> >
> >> >>
> >> >> You should try running the crawl using individual commands and see
> where
> >> >> the problem is. The nutch tutorial which Lewis pointed you to had
> those
> >> >> commands. Even peeking into the bin/crawl script would also help as
> it
> >> >> calls the nutch commands.
> >> >>
> >> >> >
> >> >> > As for the second link, it is more about using HBase as store
> instead
> >> of
> >> >> > gora. This is not really a option for me yet, cause my grid does
> not
> >> have
> >> >> > hbase installed yet. Getting it done is not much under my control
> >> >> >
> >> >>
> >> >> HBase is one of the datastores supported by Apache Gora. That
> tutorial
> >> >> speaks about how to configure Nutch (actually Gora) to use HBase as a
> >> >> backend. So, its wrong to say that the tutorial was about HBase and
> not
> >> >> Gora.
> >> >>
> >> >> >
> >> >> > the FAQ link is the one I had not gone through until I checked your
> >> >> > response, but I do not find answers to any of my questions
> >> >> > (directly/indirectly) in it.
> >> >> >
> >> >>
> >> >> Ok
> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
> >> >> > > *Lewis*
> >>
> >
>
> --
> *Lewis*
>