I guess this thread is not about kafka streams but what Josh suggested is basically my last resource plan for building kafka streams as you'll be constrained by HBase/Phoenix upsert ratio -you'll be doing 5x the number of upserts-
In my experience Kafka Streams is not bad at all doing this kind of joins -either windowed or based on ktables-. As far as you're <100M rows per stream and have a few GB of disk space per processing node available it should be doable. On Mon, 16 Apr 2018, 18:49 Rabin Banerjee, <dev.rabin.baner...@gmail.com> wrote: > Thanks Josh ! > > On Mon, Apr 16, 2018 at 11:16 PM, Josh Elser <els...@apache.org> wrote: > >> Please keep communication on the mailing list. >> >> Remember that you can execute partial-row upserts with Phoenix. As long >> as you can generate the primary key from each stream, you don't need to do >> anything special in Kafka streams. You can just submit 5 UPSERTS (one for >> each stream), and the Phoenix table will eventually have the aggregated row >> when you are finished. >> >> On 4/16/18 1:30 PM, Rabin Banerjee wrote: >> >>> Actually I haven't finalised anything just looking at different options. >>> >>> Basically if I want to join 5 streams and I want to create a >>> denormalized stream. Now the problem is if Stream 1's output for current >>> window is key 1,2,3,4,5. and might happen that all the other keys have >>> already emitted that key before, I can not join them with Kafka streams.I >>> need to maintain the whole state for all the streams. So I need to figure >>> out the key 1,2,3,4,5 from all the stream and generate a combined one as >>> realtime as possible. >>> >>> >>> On Mon, Apr 16, 2018 at 9:04 PM, Josh Elser <els...@apache.org <mailto: >>> els...@apache.org>> wrote: >>> >>> Short-answer: no. >>> >>> You're going to be much better off de-normalizing your five tables >>> into one table and eliminate the need for this JOIN. >>> >>> What made you decide to want to use Phoenix in the first place? >>> >>> >>> On 4/16/18 6:04 AM, Rabin Banerjee wrote: >>> >>> HI all, >>> >>> I am new to phoenix, I wanted to know if I have to join 5 huge >>> tables where all are keyed based on the same id (i.e. one id >>> columns is common between all of them), is there any >>> optimization to add to make this join faster , as all the data >>> for a particular key for all 5 tables will reside in the same >>> region server . >>> >>> To explain it bit more, suppose we have 5 streams all having a >>> common id that we can join with are getting stored in 5 >>> different hbase table. And we want to join them with Phoenix but >>> we dont want cross region shuffle as we already know that the >>> key is common in all 5 tables. >>> >>> >>> Thanks // >>> >>> >>> >