Hi folks -- For a very sparse HBase table (2 column families, 1000s of columns) what's the expected performance difference in using HBaseStorage with the following two STORE methods? Note that in our use case, there only a handful of unique rowkeys (approx 10).
1) GROUP BY the 1000s of columns by rowkey, and write only 10 very wide rows into HBase 2) Skip the GROUP BY, and just write the raw data as is. Conceptually, this seems like a rewrite on the 10 rowkeys, but we're writing a different column each time. Originally our processing was using approach #1, but I just modified it to use method #2, and I'm seeing a decent performance increase. I think much of the difference uis the overhead of launching another Hadoop job, since GROUP BY is a blocking operator. Any thoughts? Norbert
