Re: Kafka Setup for Daily counts on wide array of keys

2018-03-05 Thread Thakrar, Jayesh
Sorry Matt, I don’t have much idea about Kafka streaming (or any streaming for that matter). As for saving counts from your application servers to Aerospike directly, that is certain simpler, requiring less hardware, resources and development effort. One reason some people use Kafka as part of

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-05 Thread Matt Daum
And not to overthink this, but as I'm new to Kafka and streams I want to make sure that it makes the most sense to for my use case. With the streams and grouping, it looks like I'd be getting at 1 internal topic created per grouped stream which then would written and reread then totaled in the

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-05 Thread Thakrar, Jayesh
Yep, exactly. So there is some buffering that you need to do in your client and also deal with edge cases. E.g. how long should you hold on to a batch before you send a smaller batch to producer since you want a balance between batch optimization and expedience. You may need to do some

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-05 Thread Matt Daum
Ah good call, so you are really having an AVRO wrapper around your single class right? IE an array of records, correct? Then when you hit a size you are happy you send it to the producer? On Mon, Mar 5, 2018 at 12:07 PM, Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > Good luck on

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-05 Thread Thakrar, Jayesh
Good luck on your test! As for the batching within Avro and by Kafka Producer, here are my thoughts without any empirical proof. There is a certain amount of overhead in terms of execution AND bytes in converting a request record into Avro and producing (generating) a Kafka message out of it.

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-05 Thread Matt Daum
Thanks for the suggestions! It does look like it's using local RocksDB stores for the state info by default. Will look into using an external one. As for the "millions of different values per grouped attribute" an example would be assume on each requests there is a parameters "X" which at the

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-04 Thread Thakrar, Jayesh
BTW - I did not mean to rule-out Aerospike as a possible datastore. Its just that I am not familiar with it, but surely looks like a good candidate to store the raw and/or aggregated data, given that it also has a Kafka Connect module. From: "Thakrar, Jayesh"

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-04 Thread Thakrar, Jayesh
I don’t have any experience/knowledge on the Kafka inbuilt datastore, but believe thatfor some portions of streaming Kafka uses (used?) RocksDB to locally store some state info in the brokers. Personally I would use an external datastore. There's a wide choice out there - regular key-value

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-04 Thread Matt Daum
Thanks! For the counts I'd need to use a global table to make sure it's across all the data right? Also having millions of different values per grouped attribute will scale ok? On Mar 4, 2018 8:45 AM, "Thakrar, Jayesh" wrote: > Yes, that's the general design

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-04 Thread Thakrar, Jayesh
Yes, that's the general design pattern. Another thing to look into is to compress the data. Now Kafka consumer/producer can already do it for you, but we choose to compress in the applications due to a historic issue that drgraded performance, although it has been resolved now. Also, just

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-04 Thread Matt Daum
We actually don't have a kafka cluster setup yet at all. Right now just have 8 of our application servers. We currently sample some impressions and then dedupe/count outside at a different DC, but are looking to try to analyze all impressions for some overall analytics. Our requests are around

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-03 Thread Thakrar, Jayesh
Matt, If I understand correctly, you have an 8 node Kafka cluster and need to support about 1 million requests/sec into the cluster from source servers and expect to consume that for aggregation. How big are your msgs? I would suggest looking into batching multiple requests per single Kafka

Re: Kafka Setup for Daily counts on wide array of keys

2018-03-02 Thread Matt Daum
Actually it looks like the better way would be to output the counts to a new topic then ingest that topic into the DB itself. Is that the correct way? On Fri, Mar 2, 2018 at 9:24 AM, Matt Daum wrote: > I am new to Kafka but I think I have a good use case for it. I am trying

Kafka Setup for Daily counts on wide array of keys

2018-03-02 Thread Matt Daum
I am new to Kafka but I think I have a good use case for it. I am trying to build daily counts of requests based on a number of different attributes in a high throughput system (~1 million requests/sec. across all 8 servers). The different attributes are unbounded in terms of values, and some