fastest might be to use local mode, and avoid even the first map only job :)
You are right, for 10 keys it does not really matter. Even doing 1000s of updates to the same row in #2 is still a in-memory update for HBase. The actual cost of HBase put() is probably slightly high for #2, but it is a negligible part of the rest of the overhead. On Tue, Mar 6, 2012 at 10:24 AM, Norbert Burger <[email protected]>wrote: > Hi folks -- > > For a very sparse HBase table (2 column families, 1000s of columns) what's > the expected performance difference in using HBaseStorage with the > following two STORE methods? Note that in our use case, there only a > handful of unique rowkeys (approx 10). > > 1) GROUP BY the 1000s of columns by rowkey, and write only 10 very wide > rows into HBase > 2) Skip the GROUP BY, and just write the raw data as is. Conceptually, > this seems like a rewrite on the 10 rowkeys, but we're writing a different > column each time. > > Originally our processing was using approach #1, but I just modified it to > use method #2, and I'm seeing a decent performance increase. I think much > of the difference uis the overhead of launching another Hadoop job, since > GROUP BY is a blocking operator. Any thoughts? > > Norbert >
