Sweet! Thanks for the tip :)
On Mon, Nov 4, 2013 at 5:10 PM, Dhaval Shah <[email protected]>wrote: > You can use scan.setBatch() to limit the number of columns returned.. Note > that it will split up a row into multiple rows from a client's perspective > and client code might need to be modified to make use of the setBatch > feature > > Regards, > Dhaval > > > ________________________________ > From: Patrick Schless <[email protected]> > To: user <[email protected]> > Sent: Monday, 4 November 2013 6:03 PM > Subject: Scanner Caching with wildly varying row widths > > > We have an application where a row can contain anywhere between 1 and > 3600000 cells (there's only 1 column family). In practice, most rows have > under 100 cells. > > Now we want to run some mapreduce jobs that touch every cell within a range > (eg count how many cells we have). With scanner caching set to something > like 250, the job will chug along for a long time, until it hits a row with > a lot of data, then it will die. Setting the cache size down to 1 (row) > would presumably work, but take forever to run. > > We have addressed this by writing some jobs that use coprocessors, which > allow us to pull back sets of cells instead of sets of rows, but this means > we can't use any of the built-in jobs that come with hbase (eg copyTable). > Is there any way around this? Have other people had to deal with such high > variability in their row sizes? >
