Hey tsuna. I changed the algorithm significantly and eliminated the "nested" loop and it works lightening fast. I do scans separately instead of nesting.
Anyways, I have retained old code for revisiting later to find out why nested scans function poorly (perhaps only on single machine - pseudo-distributed mode) Does your table fit entirely in one region? How big are the rows? > Are you writing a lot to your table? Are you typically inserting > cells or overwriting stuff in existing ones? > > No it doesn't. It has spawned several regions. > The rows are sparse, sometimes as huge as "storing a web-page" for a column and sometimes very small, just meta data. > Yes! I do overwrite entire rows often (after the proof of concept, this won't happen) Is your pseudo-distributed HBase running on a single machine? If yes, > why not use a non-distributed HBase setup (without HDFS)? > > Yes it is running on single machine. > Good suggestion. Should setup separately. -Thanks, Dani On Tue, Jan 25, 2011 at 11:41 PM, tsuna <[email protected]> wrote: > On Tue, Jan 25, 2011 at 2:14 PM, Dani Rayan <[email protected]> wrote: > > But opening and closing the scanner inside this nested loop is taking > > mulitple seconds to complete on just 3000 rows :( > > Something is wrong with your cluster or the way you use it. The > overhead of opening / closing the scanner is normally absolutely > negligible compared to the overhead to scan the full table, even with > a table as small as just 3000 rows. > > Does your table fit entirely in one region? How big are the rows? > Are you writing a lot to your table? Are you typically inserting > cells or overwriting stuff in existing ones? > > Is your pseudo-distributed HBase running on a single machine? If yes, > why not use a non-distributed HBase setup (without HDFS)? > > -- > Benoit "tsuna" Sigoure > Software Engineer @ www.StumbleUpon.com >
