You could MR the data while it's still in HDFS, a simple count, and then insert those counts separately from the data. It would also reduce the number of increment calls (unless you have a number of incremented cells that is close to the number of increments you have to do).
J-D On Thu, Jan 12, 2012 at 11:32 AM, Neil Yalowitz <[email protected]> wrote: > Hi all, > > When performing a bulk import into HBase, what methods are available to > increment a counter? To describe the problem: a large dataset comes in, > and the most efficient way to get that data into an HBase table is to bulk > load, as described here: > > http://hbase.apache.org/bulk-loads.html > > The stumbling block arises when a counter needs to be maintained that > relates to the imported data. For our use case, each row of the inputfile > is a user log hit, but we need to maintain a counter of how many hits we > have accrued for each individual user so a separate job can take action if > the "hits" exceed a certain threshold. > > Our current implementation does not use bulk import for this reason... > instead, it uses an HTable.put() with batched flushes and subsequent > incrementColumnValue() which is very slow. > > An alternate idea was to bulk import the data and utilize the version count > as a makeshift increment, but the followup job of "find rows where versions >> 3" would result in a full table scan since there is no way to filter a > scan on "number of versions > x" (as far as I know). > > Any ideas? What techniques are other users utilizing to solve this problem?
