Re: Questions/issues with nutch

h b Thu, 27 Jun 2013 10:31:48 -0700

Ok, I ran a ant, ant jar and ant job and that seems to have picked up the
config changes.
Now, the inject output shows that it is using AvroStore as Gora storage.


Now I am getting Nullpointer on

java.lang.NullPointerException
        at
org.apache.gora.mapreduce.GoraOutputFormat.setOutputPath(GoraOutputFormat.java:70)
        at
org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:91)
        at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:521)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:636)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
        at org.apache.hadoop.mapred.Child.main(Child.java:264)

which does not look like nutch related. I will work on this and write back
if I get stuck on something else, or will write back if I succeed.


On Thu, Jun 27, 2013 at 10:18 AM, h b <[email protected]> wrote:

> Hi Lewis,
>
> Sorry for missing that one. So I update the top level conf and rebuild the
> job.
>
> cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml
>
> ......
>   <property>
>     <name>storage.data.store.class</name>
>     <value>org.apache.gora.avro.store.AvroStore</value>
>   </property>
> ......
>
> cd ~/nutch/apache-nutch-2.2/
> ant job
> cd ~/nutch/apache-nutch-2.2/runtime/deploy/
>
>
> bin/nutch inject urls -crawlId crawl1
> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: starting at
> 2013-06-27 17:12:01
> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir:
> urls
> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Using class
> org.apache.gora.memory.store.MemStore as the Gora storage class.
>
> It still shows me MemStore.
>
> In the jobtracker I see a [crawl1]inject urls job does not have
> urls_injected property
> I have a *db.score.injected* 1.0, but dont think that is anything to say
> about urls injected.
>
>
>
> On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Hi,
>> Please re-read my mail.
>> If you are using the deploy directory e.g. running on a hadoop cluster,
>> then make sure to edit nutch-site.xml from within the top level conf
>> directory _not_ the conf directory in runtime/local.
>> If you look at the ant runtime target in the build script you will see the
>> code which generates the runtime directory structure.
>> Make changes to conf/nutch-site.xml, build the job jar, navigate to
>> runtime/deploy, run the code.
>> It's easier to make the job jar and scripts in deploy available to the job
>> tracker.
>> You also didn't comment on the counters for the inject job. Do you see
>> any?
>> Best
>> Lewis
>>
>> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
>> > Here is an example of what I am saying about the config changes not
>> taking
>> > effect.
>> >
>> > cd runtime/deploy
>> > cat ../local/conf/nutch-site.xml
>> > ......
>> >
>> >   <property>
>> >     <name>storage.data.store.class</name>
>> >     <value>org.apache.gora.avro.store.AvroStore</value>
>> >   </property>
>> > .....
>> >
>> > cd ../..
>> >
>> > ant job
>> >
>> > cd runtime/deploy
>> > bin/nutch inject urls -crawlId crawl1
>> > .....
>> > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class
>> > org.apache.gora.memory.store.MemStore as the Gora storage class.
>> > .....
>> >
>> > So the nutch-site.xml was changed to use AvroStore as storage class and
>> job
>> > was rebuilt, and I reran inject, the output of which still shows that it
>> is
>> > trying to use Memstore.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney <
>> > [email protected]> wrote:
>> >
>> >> The Gora MemStore was introduced to deal predominantly with test
>> scenarios.
>> >> This is justified as the 2.x code is pulled nightly and after every
>> commit
>> >> and tested.
>> >> It is nnot thread safe and should not be used (until we fix some
>> issues)
>> >> for any kind of serious deployment.
>> >> From your inject task on the job tracker, you will be able to see
>> >> 'urls_injected' counters which represent the number of urls actually
>> >> persisted through Gora into the datastore.
>> >> I understand that HBase is not an option. Gora should also support
>> writing
>> >> the output into Avro sequence files... which can be pumped into hdfs.
>> We
>> >> have done some work on this so I suppose that right now is as good a
>> time
>> >> as any for you to try it out.
>> >> use the default datastore as org.apache.gora.avro.store.AvroStore I
>> think.
>> >> You can double check by looking into gora.properties
>> >> As a note, youu should use nutch-site.xml within the top level conf
>> >> directory for all your Nutch configuration. You should then create a
>> new
>> >> job jar for use in hadoop by calling 'ant job' after the changes are
>> made.
>> >> hth
>> >> Lewis
>> >>
>> >> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
>> >> > The quick responses flowing are very encouraging. Thanks Tejas.
>> >> > Tejas, as I mentioned earlier, in fact I actually ran it step by
>> step.
>> >> >
>> >> > So first I ran the inject command and then the readdb with dump
>> option
>> >> and
>> >> > did not see anything in the dump files, that leads me to say that the
>> >> > inject did not work.I verified the regex-urlfilter and made sure that
>> my
>> >> > url is not getting filtered.
>> >> >
>> >> > I agree that the second link is about configuring HBase as a
>> storageDB.
>> >> > However, I do not have Hbase installed and dont foresee getting it
>> >> > installed any sooner, hence using HBase for storage is not a option,
>> so I
>> >> > am going to have to stick to Gora with memory store.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <
>> [email protected]
>> >> >wrote:
>> >> >
>> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote:
>> >> >>
>> >> >> > Thanks for the response Lewis.
>> >> >> > I did read these links, I mostly followed the first link and tried
>> >> both
>> >> >> the
>> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
>> >> exception
>> >> >> on
>> >> >> > solr, so I figured that I should first deal with getting the crawl
>> >> part
>> >> >> to
>> >> >> > work and then deal with solr indexing. Hence I went back to trying
>> it
>> >> >> > stepwise.
>> >> >> >
>> >> >>
>> >> >> You should try running the crawl using individual commands and see
>> where
>> >> >> the problem is. The nutch tutorial which Lewis pointed you to had
>> those
>> >> >> commands. Even peeking into the bin/crawl script would also help as
>> it
>> >> >> calls the nutch commands.
>> >> >>
>> >> >> >
>> >> >> > As for the second link, it is more about using HBase as store
>> instead
>> >> of
>> >> >> > gora. This is not really a option for me yet, cause my grid does
>> not
>> >> have
>> >> >> > hbase installed yet. Getting it done is not much under my control
>> >> >> >
>> >> >>
>> >> >> HBase is one of the datastores supported by Apache Gora. That
>> tutorial
>> >> >> speaks about how to configure Nutch (actually Gora) to use HBase as
>> a
>> >> >> backend. So, its wrong to say that the tutorial was about HBase and
>> not
>> >> >> Gora.
>> >> >>
>> >> >> >
>> >> >> > the FAQ link is the one I had not gone through until I checked
>> your
>> >> >> > response, but I do not find answers to any of my questions
>> >> >> > (directly/indirectly) in it.
>> >> >> >
>> >> >>
>> >> >> Ok
>> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
>> >> >> > > *Lewis*
>> >>
>> >
>>
>> --
>> *Lewis*
>>
>
>

Re: Questions/issues with nutch

Reply via email to