Re: spark streaming job to hbase write

Shushant Arora Fri, 17 Jul 2015 07:43:27 -0700

Is this map creation happening on client side ?

But how does it know which RS will contain that row key in put operation
until asking the .Meta. table .
 Does Hbase client first gets that ranges of keys of each Reagionservers
and then group put objects based on Region servers ?


On Fri, Jul 17, 2015 at 7:48 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Internally AsyncProcess uses a Map which is keyed by server name:
>
>     Map<ServerName, MultiAction<Row>> actionsByServer =
>
>         new HashMap<ServerName, MultiAction<Row>>();
>
> Here MultiAction would group Put's in your example which are destined for
> the same server.
>
> Cheers
>
> On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora <shushantaror...@gmail.com
> > wrote:
>
>> Thanks !
>>
>> My key is random (hexadecimal). So hot spot should not be created.
>>
>> Is there any concept of bulk put. Say I want to raise a one put request
>> for a 1000 size batch which will hit a region server instead of individual
>> put for each key.
>>
>>
>> Htable.put(List<Put>) Does this handles batching of put based on
>> regionserver to which they will land to finally. Say in my batch there are
>> 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel <michael_se...@hotmail.com
>> > wrote:
>>
>>> You ask an interesting question…
>>>
>>> Lets set aside spark, and look at the overall ingestion pattern.
>>>
>>> Its really an ingestion pattern where your input in to the system is
>>> from a queue.
>>>
>>> Are the events discrete or continuous? (This is kinda important.)
>>>
>>> If the events are continuous then more than likely you’re going to be
>>> ingesting data where the key is somewhat sequential. If you use put(), you
>>> end up with hot spotting. And you’ll end up with regions half full.
>>> So you would be better off batching up the data and doing bulk imports.
>>>
>>> If the events are discrete, then you’ll want to use put() because the
>>> odds are you will not be using a sequential key. (You could, but I’d
>>> suggest that you rethink your primary key)
>>>
>>> Depending on the rate of ingestion, you may want to do a manual flush.
>>> (It depends on the velocity of data to be ingested and your use case )
>>> (Remember what caching occurs and where when dealing with HBase.)
>>>
>>> A third option… Depending on how you use the data, you may want to avoid
>>> storing the data in HBase, and only use HBase as an index to where you
>>> store the data files for quick access.  Again it depends on your data
>>> ingestion flow and how you intend to use the data.
>>>
>>> So really this is less a spark issue than an HBase issue when it comes
>>> to design.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> > On Jul 15, 2015, at 11:46 AM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>> >
>>> > Hi
>>> >
>>> > I have a requirement of writing in hbase table from Spark streaming
>>> app after some processing.
>>> > Is Hbase put operation the only way of writing to hbase or is there
>>> any specialised connector or rdd of spark for hbase write.
>>> >
>>> > Should Bulk load to hbase from streaming  app be avoided if output of
>>> each batch interval is just few mbs?
>>> >
>>> > Thanks
>>> >
>>>
>>> The opinions expressed here are mine, while they may reflect a cognitive
>>> thought, that is purely accidental.
>>> Use at your own risk.
>>> Michael Segel
>>> michael_segel (AT) hotmail.com
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: spark streaming job to hbase write

Reply via email to