Hi, I like to add that 'gora.buffer.read.limit' has no significance with regard to the HBaseStore. What this property means is that after each N records, it closes and reopens the store scanners (continuing at the last processed row). I'm not sure how this relates to other stores, but for HBase you might as well set it to a very high number, so that just one HBase Scanner is used for a mapreduce task. (Just do not set it too low, or there is unnecessary closing/opening of scanners).
For HBase, the property that decides how many rows are read at once is hbase.client.scanner.caching (client side property). Setting this too high means that the regionservers are overloaded with the responses. (Check for responseTooLarge errors in regionserver logs). Because it depends on the type of job how fat the inputted rows are, it is difficult to set a single value over all Nutch jobs. For example, the GeneratorJob rows are slim (just a few, small columns are inputted), but the ParserJob rows are fat, because of the content field that is inputted. What I have done is set hbase.client.scanner.caching to a high number i.e. 1000, but set a limit on the SIZE in bytes how big responses can be. This is determined by the property 'hbase.client.scanner.max.result.size', which you can set to 50100100 (50MB) or something like that. This property should be set both server (regionserver) and client side (so defined within the properties of the submitted job), otherwise you get missing rows: http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/27919 Ferdy. On Wed, Oct 3, 2012 at 9:36 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Matt, > > I know th6ere is a pile of stuff to add to this but for the time being > (until I dive into your response in detail) please see below > > On Tue, Oct 2, 2012 at 11:17 PM, Matt MacDonald <[email protected]> > wrote: > > Hi, > ... > > > > 5) What value should I set for gora.buffer.read.limit? Currently it's > > set to the default of 10000. During fetch steps #6-#12 nearly 50% of > > the time was spent reading from HBase. I was seeing > > gora.buffer.read.limit=10000 show up for several minutes in the logs. > > > > Oddly enough we have little documentation on configuration properties > for Gora, however for time being please see th6e link below for an > indication of buffered read and writes in gora > > > http://techvineyard.blogspot.co.uk/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency > > hth > Lewis >

