What would you consider being a message that is “too large”
In April I ran a bunch of tests which I outlined in the following thread http://grokbase.com/t/kafka/users/145g8k62rf/performance-testing-data-to-share It includes a google doc link with all the results (its easiest to download in excel and uses filters to drill into what you want). When looking at snappy vs NONE I didn’t see much improvement for 2200 byte messages we are looking at and for small messages NONE was the fastest. Running Kafka 8.0 on a three node cluster. 16 core, 256GB RAM, 12 4TB drives Message.size = 2200 Batch.size = 400 Partitions = 12 Replication=3 acks=leader I was able to get… SNAPPY = 151K messages per second NONE = 140K messages per second GZIP = 86K messages per second With small messages of 200 bytes SNAPPY = 660K messages per second NONE = 740K messages per second GZIP = 340K messages per second So let’s assume I can compress 2200 bytes into 200 bytes. (Just using these numbers as I ran tests on these sizes, my guess is I will not get this good compression, but its an example) If I run uncompressed I could process 140K messages per second. If I compressed in my application from 2200 to 200 bytes I could then send through Kafka at 740K events per second Bert On Thu, Jun 26, 2014 at 5:23 PM, Neha Narkhede <neha.narkh...@gmail.com> wrote: > Using a single Kafka message to contain an application snapshot has the > upside of getting atomicity for free. Either the snapshot will be written > as a whole to Kafka or not. This is poor man's transactionality. Care needs > to be taken to ensure that the message is not too large since that might > cause memory consumption problems on the server or the consumers. > > As far as compression overhead is concerned, have you tried running Snappy? > Snappy's performance is good enough to offset the decompression-compression > overhead on the server. > > Thanks, > Neha > > > On Thu, Jun 26, 2014 at 12:42 PM, Bert Corderman <bertc...@gmail.com> > wrote: > > > We are in the process of engineering a system that will be using kafka. > > The legacy system is using the local file system and a database as the > > queue. In terms of scale we process about 35 billion events per day > > contained in 15 million files. > > > > > > > > I am looking for feedback on a design decision we are discussing > > > > > > > > In our current system we depending heavily on compression as a > performance > > optimization. In kafka the use of compression has some overhead as the > > broker needs to decompress the data to assign offsets and re-compress. > > (explained in detail here > > > > > http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/ > > ) > > > > > > > > We are thinking about NOT using Kafka compression but rather compressing > > multiple rows in our code. For example let’s say we wanted to send data > in > > batches of 5,00 rows. Using Kafka compression we would use a batch size > of > > 5,000 rows and use compression. The other option is using a batch size > of > > 1 in Kafka BUT in our code take 5,000 messages, compress them and then > send > > to kafka using the kafka compression setting of none. > > > > > > > > Is this a pattern others have used? > > > > > > > > Regardless of compression I am curious if others are using a single > message > > in kafka to contain multiple messages from an application standpoint. > > > > > > Bert > > >