Re: Optimisation on join in case of all the data to be joined present in the same machine (region server)

Pedro Boado Mon, 16 Apr 2018 11:33:45 -0700

I guess this thread is not about kafka streams but what Josh suggested is
basically my last resource plan for building kafka streams as you'll be
constrained by HBase/Phoenix upsert ratio -you'll be doing 5x the number of
upserts-


In my experience Kafka Streams is not bad at all doing this kind of joins
-either windowed or based on ktables-. As far as you're <100M rows per
stream and have a few GB of disk space per processing node available it
should be doable.

On Mon, 16 Apr 2018, 18:49 Rabin Banerjee, <dev.rabin.baner...@gmail.com>
wrote:

> Thanks Josh !
>
> On Mon, Apr 16, 2018 at 11:16 PM, Josh Elser <els...@apache.org> wrote:
>
>> Please keep communication on the mailing list.
>>
>> Remember that you can execute partial-row upserts with Phoenix. As long
>> as you can generate the primary key from each stream, you don't need to do
>> anything special in Kafka streams. You can just submit 5 UPSERTS (one for
>> each stream), and the Phoenix table will eventually have the aggregated row
>> when you are finished.
>>
>> On 4/16/18 1:30 PM, Rabin Banerjee wrote:
>>
>>> Actually I haven't finalised anything just looking at different options.
>>>
>>> Basically if I want to join 5 streams and I want to create a
>>> denormalized stream. Now the problem is if Stream 1's output for current
>>> window is key 1,2,3,4,5. and might happen that all the other keys have
>>> already emitted that key before, I can not join them with Kafka streams.I
>>> need to maintain the whole state for all the streams. So I need to figure
>>> out the key 1,2,3,4,5 from all the stream and generate a combined one as
>>> realtime as possible.
>>>
>>>
>>> On Mon, Apr 16, 2018 at 9:04 PM, Josh Elser <els...@apache.org <mailto:
>>> els...@apache.org>> wrote:
>>>
>>>     Short-answer: no.
>>>
>>>     You're going to be much better off de-normalizing your five tables
>>>     into one table and eliminate the need for this JOIN.
>>>
>>>     What made you decide to want to use Phoenix in the first place?
>>>
>>>
>>>     On 4/16/18 6:04 AM, Rabin Banerjee wrote:
>>>
>>>         HI all,
>>>
>>>         I am new to phoenix, I wanted to know if I have to join 5 huge
>>>         tables where all are keyed based on the same id (i.e. one id
>>>         columns is common between all of them), is there any
>>>         optimization to add to make this join faster , as all the data
>>>         for a particular key for all 5 tables will reside in the same
>>>         region server .
>>>
>>>         To explain it bit more, suppose we have 5 streams all having a
>>>         common id that we can join with are getting stored in 5
>>>         different hbase table. And we want to join them with Phoenix but
>>>         we dont want cross region shuffle as we already know that the
>>>         key is common in all 5 tables.
>>>
>>>
>>>         Thanks //
>>>
>>>
>>>
>

Re: Optimisation on join in case of all the data to be joined present in the same machine (region server)

Reply via email to