Re: bulk import and counting increments

Jean-Daniel Cryans Thu, 12 Jan 2012 15:21:08 -0800

You could MR the data while it's still in HDFS, a simple count, and
then insert those counts separately from the data. It would also
reduce the number of increment calls (unless you have a number of
incremented cells that is close to the number of increments you have
to do).


J-D

On Thu, Jan 12, 2012 at 11:32 AM, Neil Yalowitz <[email protected]> wrote:
> Hi all,
>
> When performing a bulk import into HBase, what methods are available to
> increment a counter?  To describe the problem: a large dataset comes in,
> and the most efficient way to get that data into an HBase table is to bulk
> load, as described here:
>
> http://hbase.apache.org/bulk-loads.html
>
> The stumbling block arises when a counter needs to be maintained that
> relates to the imported data.  For our use case, each row of the inputfile
> is a user log hit, but we need to maintain a counter of how many hits we
> have accrued for each individual user so a separate job can take action if
> the "hits" exceed a certain threshold.
>
> Our current implementation does not use bulk import for this reason...
> instead, it uses an HTable.put() with batched flushes and subsequent
> incrementColumnValue() which is very slow.
>
> An alternate idea was to bulk import the data and utilize the version count
> as a makeshift increment, but the followup job of "find rows where versions
>> 3" would result in a full table scan since there is no way to filter a
> scan on "number of versions > x" (as far as I know).
>
> Any ideas?  What techniques are other users utilizing to solve this problem?

Re: bulk import and counting increments

Reply via email to