Re: Kafka Setup for Daily counts on wide array of keys

Matt Daum Mon, 05 Mar 2018 11:59:47 -0800

Ah good call, so you are really having an AVRO wrapper around your single
class right?  IE an array of records, correct?  Then when you hit a size
you are happy you send it to the producer?


On Mon, Mar 5, 2018 at 12:07 PM, Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> Good luck on your test!
>
>
>
> As for the batching within Avro and by Kafka Producer, here are my
> thoughts without any empirical proof.
>
> There is a certain amount of overhead in terms of execution AND bytes in
> converting a request record into Avro and producing (generating) a Kafka
> message out of it.
>
> For requests of size 100-200 bytes, that can be a substantial amount -
> especially the fact that you will be bundling the Avro schema for each
> request in its Kafka message.
>
>
>
> By batching the requests, you are significantly amortizing that overhead
> across many rows.
>
>
>
> *From: *Matt Daum <m...@setfive.com>
> *Date: *Monday, March 5, 2018 at 5:54 AM
>
> *To: *"Thakrar, Jayesh" <jthak...@conversantmedia.com>
> *Cc: *"users@kafka.apache.org" <users@kafka.apache.org>
> *Subject: *Re: Kafka Setup for Daily counts on wide array of keys
>
>
>
> Thanks for the suggestions!  It does look like it's using local RocksDB
> stores for the state info by default.  Will look into using an external
> one.
>
>
>
> As for the "millions of different values per grouped attribute" an example
> would be assume on each requests there is a parameters "X" which at the end
> of each day I want to know the counts per unique value, it could have 100's
> of millions of possible values.
>
>
>
> I'll start to hopefully work this week on an initial test of everything
> and will report back.  A few last questions if you have the time:
>
> - For the batching of the AVRO files, would this be different than the
> Producer batching?
>
> - Any other things you'd suggest looking out for as gotcha's or
> configurations that probably will be good to tweak further?
>
>
>
> Thanks!
>
> Matt
>
>
>
> On Sun, Mar 4, 2018 at 11:23 PM, Thakrar, Jayesh <
> jthak...@conversantmedia.com> wrote:
>
> BTW - I did not mean to rule-out Aerospike as a possible datastore.
>
> Its just that I am not familiar with it, but surely looks like a good
> candidate to store the raw and/or aggregated data, given that it also has a
> Kafka Connect module.
>
>
>
> *From: *"Thakrar, Jayesh" <jthak...@conversantmedia.com>
> *Date: *Sunday, March 4, 2018 at 9:25 PM
> *To: *Matt Daum <m...@setfive.com>
>
>
> *Cc: *"users@kafka.apache.org" <users@kafka.apache.org>
> *Subject: *Re: Kafka Setup for Daily counts on wide array of keys
>
>
>
> I don’t have any experience/knowledge on the Kafka inbuilt datastore, but
> believe thatfor some
>
> portions of streaming Kafka uses (used?) RocksDB to locally store some
> state info in the brokers.
>
>
>
> Personally  I would use an external datastore.
>
> There's a wide choice out there - regular key-value stores like Cassandra,
> ScyllaDB, RocksDB, timeseries key-value stores like InfluxDB to regular
> RDBMSes.
>
> If you have hadoop in the picture, its even possible to bypass a datastore
> completely (if appropriate) and store the raw data on HDFS organized by
> (say) date+hour
>
> by using periodic (minute to hourly) extract jobs and store data in
> hive-compatible directory structure using ORC or Parquet.
>
>
>
> The reason for shying away from NoSQL datastores is their tendency to do
> compaction on data which leads to unnecessary reads and writes (referred to
> as write-amplification).
>
> With periodic jobs in Hadoop, you (usually) write your data once only.
> Ofcourse with that approach you loose the "random/keyed access" to the
> data,
>
> but if you are only interested in the aggregations across various
> dimensions, those can be stored in a SQL/NoSQL datastore.
>
>
>
> As for "having millions of different values per grouped attribute" - not
> sure what you mean by them.
>
> Is it that each record has some fields that represent different kinds of
> attributes and that their domain can have millions to hundreds of millions
> of values?
>
> I don't think that should matter.
>
>
>
> *From: *Matt Daum <m...@setfive.com>
> *Date: *Sunday, March 4, 2018 at 2:39 PM
> *To: *"Thakrar, Jayesh" <jthak...@conversantmedia.com>
> *Cc: *"users@kafka.apache.org" <users@kafka.apache.org>
> *Subject: *Re: Kafka Setup for Daily counts on wide array of keys
>
>
>
> Thanks! For the counts I'd need to use a global table to make sure it's
> across all the data right?   Also having millions of different values per
> grouped attribute will scale ok?
>
>
>
> On Mar 4, 2018 8:45 AM, "Thakrar, Jayesh" <jthak...@conversantmedia.com>
> wrote:
>
> Yes, that's the general design pattern. Another thing to look into is to
> compress the data. Now Kafka consumer/producer can already do it for you,
> but we choose to compress in the applications due to a historic issue that
> drgraded performance,  although it has been resolved now.
>
> Also,  just keep in mind that while you do your batching, kafka producer
> also tries to batch msgs to Kafka, and you will need to ensure you have
> enough buffer memory. However that's all configurable.
>
> Finally ensure you have the latest java updates and have kafka 0.10.2 or
> higher.
>
> Jayesh
>
>
> ------------------------------
>
> *From:* Matt Daum <m...@setfive.com>
> *Sent:* Sunday, March 4, 2018 7:06:19 AM
> *To:* Thakrar, Jayesh
> *Cc:* users@kafka.apache.org
> *Subject:* Re: Kafka Setup for Daily counts on wide array of keys
>
>
>
> We actually don't have a kafka cluster setup yet at all.  Right now just
> have 8 of our application servers.  We currently sample some impressions
> and then dedupe/count outside at a different DC, but are looking to try to
> analyze all impressions for some overall analytics.
>
>
>
> Our requests are around 100-200 bytes each.  If we lost some of them due
> to network jitter etc. it would be fine we're trying to just get overall a
> rough count of each attribute.  Creating batched messages definitely makes
> sense and will also cut down on the network IO.
>
>
>
> We're trying to determine the required setup for Kafka to do what we're
> looking to do as these are physical servers so we'll most likely need to
> buy new hardware.  For the first run I think we'll try it out on one of our
> application clusters that get a smaller amount traffic (300-400k req/sec)
> and run the kafka cluster on the same machines as the applications.
>
>
>
> So would the best route here be something like each application server
> batches requests, send it to kafka, have a stream consumer that then
> tallies up the totals per attribute that we want to track, output that to a
> new topic, which then goes to a sink to either a DB or something like S3
> which then we read into our external DBs?
>
>
>
> Thanks!
>
>
>
> On Sun, Mar 4, 2018 at 12:31 AM, Thakrar, Jayesh <
> jthak...@conversantmedia.com> wrote:
>
> Matt,
>
> If I understand correctly, you have an 8 node Kafka cluster and need to
> support  about 1 million requests/sec into the cluster from source servers
> and expect to consume that for aggregation.
>
> How big are your msgs?
>
> I would suggest looking into batching multiple requests per single Kafka
> msg to achieve desired throughput.
>
> So e.g. on the request receiving systems, I would suggest creating a
> logical avro file (byte buffer) of say N requests and then making that into
> one Kafka msg payload.
>
> We have a similar situation (https://www.slideshare.net/JayeshThakrar/
> apacheconflumekafka2016) and found anything from 4x to 10x better
> throughput with batching as compared to one request per msg.
> We have different kinds of msgs/topics and the individual "request" size
> varies from  about 100 bytes to 1+ KB.
>
>
> On 3/2/18, 8:24 AM, "Matt Daum" <m...@setfive.com> wrote:
>
>     I am new to Kafka but I think I have a good use case for it.  I am
> trying
>     to build daily counts of requests based on a number of different
> attributes
>     in a high throughput system (~1 million requests/sec. across all  8
>     servers).  The different attributes are unbounded in terms of values,
> and
>     some will spread across 100's of millions values.  This is my current
>     through process, let me know where I could be more efficient or if
> there is
>     a better way to do it.
>
>     I'll create an AVRO object "Impression" which has all the attributes
> of the
>     inbound request.  My application servers then will on each request
> create
>     and send this to a single kafka topic.
>
>     I'll then have a consumer which creates a stream from the topic.  From
>     there I'll use the windowed timeframes and groupBy to group by the
>     attributes on each given day.  At the end of the day I'd need to read
> out
>     the data store to an external system for storage.  Since I won't know
> all
>     the values I'd need something similar to the KVStore.all() but for
>     WindowedKV Stores.  This appears that it'd be possible in 1.1 with this
>     commit:
>     https://github.com/apache/kafka/commit/1d1c8575961bf6bce7decb049be7f1
> 0ca76bd0c5
>     .
>
>     Is this the best approach to doing this?  Or would I be better using
> the
>     stream to listen and then an external DB like Aerospike to store the
> counts
>     and read out of it directly end of day.
>
>     Thanks for the help!
>     Daum
>
>
>
>
>

Re: Kafka Setup for Daily counts on wide array of keys

Reply via email to