Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Ferdy Galema Thu, 11 Oct 2012 02:26:54 -0700

Hi,

I like to add that 'gora.buffer.read.limit' has no significance with regard
to the HBaseStore. What this property means is that after each N records,
it closes and reopens the store scanners (continuing at the last processed
row). I'm not sure how this relates to other stores, but for HBase you
might as well set it to a very high number, so that just one HBase Scanner
is used for  a mapreduce task. (Just do not set it too low, or there is
unnecessary closing/opening of scanners).

For HBase, the property that decides how many rows are read at once is
hbase.client.scanner.caching (client side property). Setting this too high
means that the regionservers are overloaded with the responses. (Check
for responseTooLarge
errors in regionserver logs). Because it depends on the type of job how fat
the inputted rows are, it is difficult to set a single value over all Nutch
jobs. For example, the GeneratorJob rows are slim (just a few, small
columns are inputted), but the ParserJob rows are fat, because of the
content field that is inputted. What I have done is set
hbase.client.scanner.caching
to a high number i.e. 1000, but set a limit on the SIZE in bytes how big
responses can be. This is determined by the property
'hbase.client.scanner.max.result.size',
which you can set to 50100100 (50MB) or something like that. This property
should be set both server (regionserver) and client side (so defined within
the properties of the submitted job), otherwise you get missing rows:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/27919

Ferdy.

On Wed, Oct 3, 2012 at 9:36 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Matt,
>
> I know th6ere is a pile of stuff to add to this but for the time being
> (until I dive into your response in detail) please see below
>
> On Tue, Oct 2, 2012 at 11:17 PM, Matt MacDonald <[email protected]>
> wrote:
> > Hi,
> ...
> >
> > 5) What value should I set for gora.buffer.read.limit? Currently it's
> > set to the default of 10000. During fetch steps #6-#12 nearly 50% of
> > the time was spent reading from HBase. I was seeing
> > gora.buffer.read.limit=10000 show up for several minutes in the logs.
> >
>
> Oddly enough we have little documentation on configuration properties
> for Gora, however for time being please see th6e link below for an
> indication of buffered read and writes in gora
>
>
> http://techvineyard.blogspot.co.uk/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency
>
> hth
> Lewis
>

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Reply via email to