Re: Streaming data to htable

Andrey Stepachev Fri, 13 Feb 2015 07:59:23 -0800

Hi hongbin,

It seems that depend on how many data you ingest.
In case of big enough I'd look at creating HFiles
directly without mapreduce (for example using
HFileOutputFormat without mapreduce or using
HFileWriter directly).
Created files can be imported by LoadIncrementalHFiles#doBulkLoad
directly into hbase. And need to be sure that your
regions will not split too fast, bulk load can load only
hfiles which contain keyvalues from one or two adjust
regions. (better if splits are disabled and did externally)


But you need to be sure that you actually need
to do such micromanagement and not just stick
with regular Puts. HBase can sustain quite good
amount of input data to start worry about.

Cheers.

On Fri, Feb 13, 2015 at 6:20 AM, hongbin ma <[email protected]> wrote:

> hi,
>
> I'm trying to use a htable to store data that comes in a streaming fashion.
> The streaming in data is guaranteed to have a larger KEY than ANY existing
> keys in the table.
> And the data will be READONLY.
>
> The data is streaming in at a very high rate, I don't want to issue a PUT
> operation for each data entry, because obviously it is poor in performance.
> I'm thinking about pooling the data entries and flush them to hbase every
> five minutes, and I AFAIK there're few options:
>
> 1.  Pool the data entries, and every 5 minute run a MR job to convert the
> data to hfile format. This approach could avoid the overhead of single PUT,
> but I'm afraid the MR job might be too costly( waiting in the job queue) to
> keep in pace.
>
> 2. Use HtableInterface.put(List<Put>) the batched version should be faster,
> but I'm not quite sure how much.
>
> 3.?
>
> can anyone give me some advice on this?
> thanks!
>
> hongbin
>



-- 
Andrey.

Re: Streaming data to htable

Reply via email to