Re: Questions/issues with nutch

h b Thu, 27 Jun 2013 13:48:24 -0700

Ok, so avro did not work quite well for me, I got a test grid with hbase,
and I started using that for now. All steps ran without errors and I see my
crawled doc in hbase.
However, after running the solr integration, and querying solr, I get back
nothing. Index files look very tiny. The one thing I noted is a message
during almost every step


13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and nameclass match but
mismatching table names  mappingfile schema is 'webpage' vs actual schema
'crawl2_webpage' , assuming they are the same.

This looks suspicious and I think this is the one causing the solr index to
be empty. Googling suggested I should edit the nutch-default,xml, I tried
and rebuilt the job but no luck with this message.



On Thu, Jun 27, 2013 at 10:30 AM, h b <[email protected]> wrote:

> Ok, I ran a ant, ant jar and ant job and that seems to have picked up the
> config changes.
> Now, the inject output shows that it is using AvroStore as Gora storage.
>
> Now I am getting Nullpointer on
>
> java.lang.NullPointerException
>         at
> org.apache.gora.mapreduce.GoraOutputFormat.setOutputPath(GoraOutputFormat.java:70)
>         at
> org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:91)
>         at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:521)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:636)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
>         at org.apache.hadoop.mapred.Child.main(Child.java:264)
>
> which does not look like nutch related. I will work on this and write back
> if I get stuck on something else, or will write back if I succeed.
>
>
> On Thu, Jun 27, 2013 at 10:18 AM, h b <[email protected]> wrote:
>
>> Hi Lewis,
>>
>> Sorry for missing that one. So I update the top level conf and rebuild
>> the job.
>>
>> cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml
>>
>> ......
>>   <property>
>>     <name>storage.data.store.class</name>
>>     <value>org.apache.gora.avro.store.AvroStore</value>
>>   </property>
>> ......
>>
>> cd ~/nutch/apache-nutch-2.2/
>> ant job
>> cd ~/nutch/apache-nutch-2.2/runtime/deploy/
>>
>>
>> bin/nutch inject urls -crawlId crawl1
>> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: starting at
>> 2013-06-27 17:12:01
>> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir:
>> urls
>> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Using class
>> org.apache.gora.memory.store.MemStore as the Gora storage class.
>>
>> It still shows me MemStore.
>>
>> In the jobtracker I see a [crawl1]inject urls job does not have
>> urls_injected property
>> I have a *db.score.injected* 1.0, but dont think that is anything to say
>> about urls injected.
>>
>>
>>
>> On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney <
>> [email protected]> wrote:
>>
>>> Hi,
>>> Please re-read my mail.
>>> If you are using the deploy directory e.g. running on a hadoop cluster,
>>> then make sure to edit nutch-site.xml from within the top level conf
>>> directory _not_ the conf directory in runtime/local.
>>> If you look at the ant runtime target in the build script you will see
>>> the
>>> code which generates the runtime directory structure.
>>> Make changes to conf/nutch-site.xml, build the job jar, navigate to
>>> runtime/deploy, run the code.
>>> It's easier to make the job jar and scripts in deploy available to the
>>> job
>>> tracker.
>>> You also didn't comment on the counters for the inject job. Do you see
>>> any?
>>> Best
>>> Lewis
>>>
>>> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
>>> > Here is an example of what I am saying about the config changes not
>>> taking
>>> > effect.
>>> >
>>> > cd runtime/deploy
>>> > cat ../local/conf/nutch-site.xml
>>> > ......
>>> >
>>> >   <property>
>>> >     <name>storage.data.store.class</name>
>>> >     <value>org.apache.gora.avro.store.AvroStore</value>
>>> >   </property>
>>> > .....
>>> >
>>> > cd ../..
>>> >
>>> > ant job
>>> >
>>> > cd runtime/deploy
>>> > bin/nutch inject urls -crawlId crawl1
>>> > .....
>>> > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class
>>> > org.apache.gora.memory.store.MemStore as the Gora storage class.
>>> > .....
>>> >
>>> > So the nutch-site.xml was changed to use AvroStore as storage class and
>>> job
>>> > was rebuilt, and I reran inject, the output of which still shows that
>>> it
>>> is
>>> > trying to use Memstore.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney <
>>> > [email protected]> wrote:
>>> >
>>> >> The Gora MemStore was introduced to deal predominantly with test
>>> scenarios.
>>> >> This is justified as the 2.x code is pulled nightly and after every
>>> commit
>>> >> and tested.
>>> >> It is nnot thread safe and should not be used (until we fix some
>>> issues)
>>> >> for any kind of serious deployment.
>>> >> From your inject task on the job tracker, you will be able to see
>>> >> 'urls_injected' counters which represent the number of urls actually
>>> >> persisted through Gora into the datastore.
>>> >> I understand that HBase is not an option. Gora should also support
>>> writing
>>> >> the output into Avro sequence files... which can be pumped into hdfs.
>>> We
>>> >> have done some work on this so I suppose that right now is as good a
>>> time
>>> >> as any for you to try it out.
>>> >> use the default datastore as org.apache.gora.avro.store.AvroStore I
>>> think.
>>> >> You can double check by looking into gora.properties
>>> >> As a note, youu should use nutch-site.xml within the top level conf
>>> >> directory for all your Nutch configuration. You should then create a
>>> new
>>> >> job jar for use in hadoop by calling 'ant job' after the changes are
>>> made.
>>> >> hth
>>> >> Lewis
>>> >>
>>> >> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
>>> >> > The quick responses flowing are very encouraging. Thanks Tejas.
>>> >> > Tejas, as I mentioned earlier, in fact I actually ran it step by
>>> step.
>>> >> >
>>> >> > So first I ran the inject command and then the readdb with dump
>>> option
>>> >> and
>>> >> > did not see anything in the dump files, that leads me to say that
>>> the
>>> >> > inject did not work.I verified the regex-urlfilter and made sure
>>> that
>>> my
>>> >> > url is not getting filtered.
>>> >> >
>>> >> > I agree that the second link is about configuring HBase as a
>>> storageDB.
>>> >> > However, I do not have Hbase installed and dont foresee getting it
>>> >> > installed any sooner, hence using HBase for storage is not a option,
>>> so I
>>> >> > am going to have to stick to Gora with memory store.
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <
>>> [email protected]
>>> >> >wrote:
>>> >> >
>>> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote:
>>> >> >>
>>> >> >> > Thanks for the response Lewis.
>>> >> >> > I did read these links, I mostly followed the first link and
>>> tried
>>> >> both
>>> >> >> the
>>> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
>>> >> exception
>>> >> >> on
>>> >> >> > solr, so I figured that I should first deal with getting the
>>> crawl
>>> >> part
>>> >> >> to
>>> >> >> > work and then deal with solr indexing. Hence I went back to
>>> trying
>>> it
>>> >> >> > stepwise.
>>> >> >> >
>>> >> >>
>>> >> >> You should try running the crawl using individual commands and see
>>> where
>>> >> >> the problem is. The nutch tutorial which Lewis pointed you to had
>>> those
>>> >> >> commands. Even peeking into the bin/crawl script would also help
>>> as it
>>> >> >> calls the nutch commands.
>>> >> >>
>>> >> >> >
>>> >> >> > As for the second link, it is more about using HBase as store
>>> instead
>>> >> of
>>> >> >> > gora. This is not really a option for me yet, cause my grid does
>>> not
>>> >> have
>>> >> >> > hbase installed yet. Getting it done is not much under my control
>>> >> >> >
>>> >> >>
>>> >> >> HBase is one of the datastores supported by Apache Gora. That
>>> tutorial
>>> >> >> speaks about how to configure Nutch (actually Gora) to use HBase
>>> as a
>>> >> >> backend. So, its wrong to say that the tutorial was about HBase and
>>> not
>>> >> >> Gora.
>>> >> >>
>>> >> >> >
>>> >> >> > the FAQ link is the one I had not gone through until I checked
>>> your
>>> >> >> > response, but I do not find answers to any of my questions
>>> >> >> > (directly/indirectly) in it.
>>> >> >> >
>>> >> >>
>>> >> >> Ok
>>> >> >>
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
>>> >> >> > > *Lewis*
>>> >>
>>> >
>>>
>>> --
>>> *Lewis*
>>>
>>
>>
>

Re: Questions/issues with nutch

Reply via email to