Re: Questions/issues with nutch

Lewis John Mcgibbney Thu, 27 Jun 2013 07:11:12 -0700

Hi,
Please re-read my mail.
If you are using the deploy directory e.g. running on a hadoop cluster,
then make sure to edit nutch-site.xml from within the top level conf
directory _not_ the conf directory in runtime/local.
If you look at the ant runtime target in the build script you will see the
code which generates the runtime directory structure.
Make changes to conf/nutch-site.xml, build the job jar, navigate to
runtime/deploy, run the code.
It's easier to make the job jar and scripts in deploy available to the job
tracker.
You also didn't comment on the counters for the inject job. Do you see any?
Best
Lewis


On Wednesday, June 26, 2013, h b <[email protected]> wrote:
> Here is an example of what I am saying about the config changes not taking
> effect.
>
> cd runtime/deploy
> cat ../local/conf/nutch-site.xml
> ......
>
>   <property>
>     <name>storage.data.store.class</name>
>     <value>org.apache.gora.avro.store.AvroStore</value>
>   </property>
> .....
>
> cd ../..
>
> ant job
>
> cd runtime/deploy
> bin/nutch inject urls -crawlId crawl1
> .....
> 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class
> org.apache.gora.memory.store.MemStore as the Gora storage class.
> .....
>
> So the nutch-site.xml was changed to use AvroStore as storage class and
job
> was rebuilt, and I reran inject, the output of which still shows that it
is
> trying to use Memstore.
>
>
>
>
>
>
>
>
> On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> The Gora MemStore was introduced to deal predominantly with test
scenarios.
>> This is justified as the 2.x code is pulled nightly and after every
commit
>> and tested.
>> It is nnot thread safe and should not be used (until we fix some issues)
>> for any kind of serious deployment.
>> From your inject task on the job tracker, you will be able to see
>> 'urls_injected' counters which represent the number of urls actually
>> persisted through Gora into the datastore.
>> I understand that HBase is not an option. Gora should also support
writing
>> the output into Avro sequence files... which can be pumped into hdfs. We
>> have done some work on this so I suppose that right now is as good a time
>> as any for you to try it out.
>> use the default datastore as org.apache.gora.avro.store.AvroStore I
think.
>> You can double check by looking into gora.properties
>> As a note, youu should use nutch-site.xml within the top level conf
>> directory for all your Nutch configuration. You should then create a new
>> job jar for use in hadoop by calling 'ant job' after the changes are
made.
>> hth
>> Lewis
>>
>> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
>> > The quick responses flowing are very encouraging. Thanks Tejas.
>> > Tejas, as I mentioned earlier, in fact I actually ran it step by step.
>> >
>> > So first I ran the inject command and then the readdb with dump option
>> and
>> > did not see anything in the dump files, that leads me to say that the
>> > inject did not work.I verified the regex-urlfilter and made sure that
my
>> > url is not getting filtered.
>> >
>> > I agree that the second link is about configuring HBase as a storageDB.
>> > However, I do not have Hbase installed and dont foresee getting it
>> > installed any sooner, hence using HBase for storage is not a option,
so I
>> > am going to have to stick to Gora with memory store.
>> >
>> >
>> >
>> >
>> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <[email protected]
>> >wrote:
>> >
>> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote:
>> >>
>> >> > Thanks for the response Lewis.
>> >> > I did read these links, I mostly followed the first link and tried
>> both
>> >> the
>> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer
>> exception
>> >> on
>> >> > solr, so I figured that I should first deal with getting the crawl
>> part
>> >> to
>> >> > work and then deal with solr indexing. Hence I went back to trying
it
>> >> > stepwise.
>> >> >
>> >>
>> >> You should try running the crawl using individual commands and see
where
>> >> the problem is. The nutch tutorial which Lewis pointed you to had
those
>> >> commands. Even peeking into the bin/crawl script would also help as it
>> >> calls the nutch commands.
>> >>
>> >> >
>> >> > As for the second link, it is more about using HBase as store
instead
>> of
>> >> > gora. This is not really a option for me yet, cause my grid does not
>> have
>> >> > hbase installed yet. Getting it done is not much under my control
>> >> >
>> >>
>> >> HBase is one of the datastores supported by Apache Gora. That tutorial
>> >> speaks about how to configure Nutch (actually Gora) to use HBase as a
>> >> backend. So, its wrong to say that the tutorial was about HBase and
not
>> >> Gora.
>> >>
>> >> >
>> >> > the FAQ link is the one I had not gone through until I checked your
>> >> > response, but I do not find answers to any of my questions
>> >> > (directly/indirectly) in it.
>> >> >
>> >>
>> >> Ok
>> >>
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
>> >> > > *Lewis*
>>
>

-- 
*Lewis*

Re: Questions/issues with nutch

Reply via email to