Re: Questions/issues with nutch

Tejas Patil Fri, 28 Jun 2013 09:53:38 -0700

The "storage.schema.webpage" seems messed up but I don't have ample time
now to look into it. Here is what I would suggest to get things working:
*
*
*[1] Remove all the old data from HBase*


(I assume that HBase is running while you do this)
*cd $HBASE_HOME*
*./bin/hbase shell
*
In the HBase shell, use "list" to see all the tables, delete all of those
related to Nutch (ones named as *webpage).
Remove them using "disable" and "drop" commands.

eg. if one of the tables is "webpage", you would run this:
*disable 'webpage'
*
*drop 'webpage'*
* *

*[2] Run crawl*
I assume that you have not changed "storage.schema.webpage" is
nutch-site.xml and nutch-default.xml. If yes, revert it to:

*<property>*
*  <name>storage.schema.webpage</**name>*
*  <value>webpage</value>*
*  <description>This value holds the schema name used for Nutch web db.*
*  Note that Nutch ignores the value in the gora mapping files, and uses*
*  this as the webpage schema name.*
*  </description>*
*</property>*

Run crawl commands:
*bin/nutch inject urls/*
*bin/nutch generate -topN 50000  -noFilter -adddays 0*
*bin/nutch fetch -all -threads 5  *
*bin/nutch parse -all *

*[3] Perform indexing*
I assume that you have Solr setup and NUTCH_HOME/conf/schema.xml copied in
${SOLR_HOME}/example/solr/conf/. See bullets 4-6 in [0] for details.
Start solr and run the indexing command:
*bin/nutch solrindex  $SOLR_URL -all *

[0] : http://wiki.apache.org/nutch/NutchTutorial

Thanks,
Tejas

On Thu, Jun 27, 2013 at 1:47 PM, h b <[email protected]> wrote:

> Ok, so avro did not work quite well for me, I got a test grid with hbase,
> and I started using that for now. All steps ran without errors and I see my
> crawled doc in hbase.
> However, after running the solr integration, and querying solr, I get back
> nothing. Index files look very tiny. The one thing I noted is a message
> during almost every step
>
> 13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and nameclass match but
> mismatching table names  mappingfile schema is 'webpage' vs actual schema
> 'crawl2_webpage' , assuming they are the same.
>
> This looks suspicious and I think this is the one causing the solr index to
> be empty. Googling suggested I should edit the nutch-default,xml, I tried
> and rebuilt the job but no luck with this message.
>
>
>
> On Thu, Jun 27, 2013 at 10:30 AM, h b <[email protected]> wrote:
>
> > Ok, I ran a ant, ant jar and ant job and that seems to have picked up the
> > config changes.
> > Now, the inject output shows that it is using AvroStore as Gora storage.
> >
> > Now I am getting Nullpointer on
> >
> > java.lang.NullPointerException
> >         at
> >
> org.apache.gora.mapreduce.GoraOutputFormat.setOutputPath(GoraOutputFormat.java:70)
> >         at
> >
> org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:91)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:521)
> >         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:636)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:396)
> >         at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> >         at org.apache.hadoop.mapred.Child.main(Child.java:264)
> >
> > which does not look like nutch related. I will work on this and write
> back
> > if I get stuck on something else, or will write back if I succeed.
> >
> >
> > On Thu, Jun 27, 2013 at 10:18 AM, h b <[email protected]> wrote:
> >
> >> Hi Lewis,
> >>
> >> Sorry for missing that one. So I update the top level conf and rebuild
> >> the job.
> >>
> >> cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml
> >>
> >> ......
> >>   <property>
> >>     <name>storage.data.store.class</name>
> >>     <value>org.apache.gora.avro.store.AvroStore</value>
> >>   </property>
> >> ......
> >>
> >> cd ~/nutch/apache-nutch-2.2/
> >> ant job
> >> cd ~/nutch/apache-nutch-2.2/runtime/deploy/
> >>
> >>
> >> bin/nutch inject urls -crawlId crawl1
> >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: starting at
> >> 2013-06-27 17:12:01
> >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir:
> >> urls
> >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Using class
> >> org.apache.gora.memory.store.MemStore as the Gora storage class.
> >>
> >> It still shows me MemStore.
> >>
> >> In the jobtracker I see a [crawl1]inject urls job does not have
> >> urls_injected property
> >> I have a *db.score.injected* 1.0, but dont think that is anything to say
> >> about urls injected.
> >>
> >>
> >>
> >> On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney <
> >> [email protected]> wrote:
> >>
> >>> Hi,
> >>> Please re-read my mail.
> >>> If you are using the deploy directory e.g. running on a hadoop cluster,
> >>> then make sure to edit nutch-site.xml from within the top level conf
> >>> directory _not_ the conf directory in runtime/local.
> >>> If you look at the ant runtime target in the build script you will see
> >>> the
> >>> code which generates the runtime directory structure.
> >>> Make changes to conf/nutch-site.xml, build the job jar, navigate to
> >>> runtime/deploy, run the code.
> >>> It's easier to make the job jar and scripts in deploy available to the
> >>> job
> >>> tracker.
> >>> You also didn't comment on the counters for the inject job. Do you see
> >>> any?
> >>> Best
> >>> Lewis
> >>>
> >>> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
> >>> > Here is an example of what I am saying about the config changes not
> >>> taking
> >>> > effect.
> >>> >
> >>> > cd runtime/deploy
> >>> > cat ../local/conf/nutch-site.xml
> >>> > ......
> >>> >
> >>> >   <property>
> >>> >     <name>storage.data.store.class</name>
> >>> >     <value>org.apache.gora.avro.store.AvroStore</value>
> >>> >   </property>
> >>> > .....
> >>> >
> >>> > cd ../..
> >>> >
> >>> > ant job
> >>> >
> >>> > cd runtime/deploy
> >>> > bin/nutch inject urls -crawlId crawl1
> >>> > .....
> >>> > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class
> >>> > org.apache.gora.memory.store.MemStore as the Gora storage class.
> >>> > .....
> >>> >
> >>> > So the nutch-site.xml was changed to use AvroStore as storage class
> and
> >>> job
> >>> > was rebuilt, and I reran inject, the output of which still shows that
> >>> it
> >>> is
> >>> > trying to use Memstore.
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney <
> >>> > [email protected]> wrote:
> >>> >
> >>> >> The Gora MemStore was introduced to deal predominantly with test
> >>> scenarios.
> >>> >> This is justified as the 2.x code is pulled nightly and after every
> >>> commit
> >>> >> and tested.
> >>> >> It is nnot thread safe and should not be used (until we fix some
> >>> issues)
> >>> >> for any kind of serious deployment.
> >>> >> From your inject task on the job tracker, you will be able to see
> >>> >> 'urls_injected' counters which represent the number of urls actually
> >>> >> persisted through Gora into the datastore.
> >>> >> I understand that HBase is not an option. Gora should also support
> >>> writing
> >>> >> the output into Avro sequence files... which can be pumped into
> hdfs.
> >>> We
> >>> >> have done some work on this so I suppose that right now is as good a
> >>> time
> >>> >> as any for you to try it out.
> >>> >> use the default datastore as org.apache.gora.avro.store.AvroStore I
> >>> think.
> >>> >> You can double check by looking into gora.properties
> >>> >> As a note, youu should use nutch-site.xml within the top level conf
> >>> >> directory for all your Nutch configuration. You should then create a
> >>> new
> >>> >> job jar for use in hadoop by calling 'ant job' after the changes are
> >>> made.
> >>> >> hth
> >>> >> Lewis
> >>> >>
> >>> >> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
> >>> >> > The quick responses flowing are very encouraging. Thanks Tejas.
> >>> >> > Tejas, as I mentioned earlier, in fact I actually ran it step by
> >>> step.
> >>> >> >
> >>> >> > So first I ran the inject command and then the readdb with dump
> >>> option
> >>> >> and
> >>> >> > did not see anything in the dump files, that leads me to say that
> >>> the
> >>> >> > inject did not work.I verified the regex-urlfilter and made sure
> >>> that
> >>> my
> >>> >> > url is not getting filtered.
> >>> >> >
> >>> >> > I agree that the second link is about configuring HBase as a
> >>> storageDB.
> >>> >> > However, I do not have Hbase installed and dont foresee getting it
> >>> >> > installed any sooner, hence using HBase for storage is not a
> option,
> >>> so I
> >>> >> > am going to have to stick to Gora with memory store.
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <
> >>> [email protected]
> >>> >> >wrote:
> >>> >> >
> >>> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote:
> >>> >> >>
> >>> >> >> > Thanks for the response Lewis.
> >>> >> >> > I did read these links, I mostly followed the first link and
> >>> tried
> >>> >> both
> >>> >> >> the
> >>> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
> >>> >> exception
> >>> >> >> on
> >>> >> >> > solr, so I figured that I should first deal with getting the
> >>> crawl
> >>> >> part
> >>> >> >> to
> >>> >> >> > work and then deal with solr indexing. Hence I went back to
> >>> trying
> >>> it
> >>> >> >> > stepwise.
> >>> >> >> >
> >>> >> >>
> >>> >> >> You should try running the crawl using individual commands and
> see
> >>> where
> >>> >> >> the problem is. The nutch tutorial which Lewis pointed you to had
> >>> those
> >>> >> >> commands. Even peeking into the bin/crawl script would also help
> >>> as it
> >>> >> >> calls the nutch commands.
> >>> >> >>
> >>> >> >> >
> >>> >> >> > As for the second link, it is more about using HBase as store
> >>> instead
> >>> >> of
> >>> >> >> > gora. This is not really a option for me yet, cause my grid
> does
> >>> not
> >>> >> have
> >>> >> >> > hbase installed yet. Getting it done is not much under my
> control
> >>> >> >> >
> >>> >> >>
> >>> >> >> HBase is one of the datastores supported by Apache Gora. That
> >>> tutorial
> >>> >> >> speaks about how to configure Nutch (actually Gora) to use HBase
> >>> as a
> >>> >> >> backend. So, its wrong to say that the tutorial was about HBase
> and
> >>> not
> >>> >> >> Gora.
> >>> >> >>
> >>> >> >> >
> >>> >> >> > the FAQ link is the one I had not gone through until I checked
> >>> your
> >>> >> >> > response, but I do not find answers to any of my questions
> >>> >> >> > (directly/indirectly) in it.
> >>> >> >> >
> >>> >> >>
> >>> >> >> Ok
> >>> >> >>
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
> >>> >> >> > > *Lewis*
> >>> >>
> >>> >
> >>>
> >>> --
> >>> *Lewis*
> >>>
> >>
> >>
> >
>

Re: Questions/issues with nutch

Reply via email to