Hello, During my Internship at KLA Inc, we were using Kafka as a streaming platform. We were using it to transfer text and image data from a Linux Machine to a Windows machine.
The data is in the form of discrete non serialized records that contain text and image data in the form of blobs. We then serialize it using Flatbuffers or Protocol Buffers and then send it to a Kafka Producer, that then pushes this data to the Broker. This data is then consumed on the windows side by 10 Consumers per topic. There were two topics, with 10 partitions each. One dealt with the text data and the other one dealt with the blob data. So we simulate large volumes of data, by reading a single file once and then looping the contents of it until we reach desired volume of data. We tested it for 10K (722 MB), 100K(7.01 GB), 500K (35 GB) , 1M (69.7 GB) records it worked flawlessly. But the issue occurred when we tried to simulate 5M records which amounts to 348 GB. The text data worked perfectly, but when it came to consuming the blob/Image data, the consumer would not poll more than 1 or 2 records, whereas, with the other runs, the image consumer would poll >1000 records. We were unable to solve this issue and I don't know if it's a configuration issue or an issue with the systems or Kafka's capability. I believe the producer also slowed down massively. We were monitoring it through Prometheus and Grafana and the only insight we got was that it started at around 2- 3M records (140 - 210 GB) These are the specifications of the Systems used along with the specification of the Producer, Broker and Consumer: *Linux Machine: * Processor : Intel® Xeon® Processor E5-2658 @ 2.40 GHz Storage : 2 TB SSD + 1 TB HDD RAM : 128 GB Operating System : SUSE Linux Enterprise Server 11 *Windows Machine* Processor : Intel® Xeon® Processor E5-2620 @ 2.00 GHz RAM: 32 GB Operating System: Windows Server 2008 R2 Standard *Consumer Config:* max.partition.fetch.bytes=1000000000 max.poll.records=1000000000 fetch.min.bytes=100000000 fetch.max.wait.ms=1000 fetch.max.bytes=1000000000 poll.ms=1000 *Producer config:* compression.type=snappy batch.size=30000 linger.ms=10 buffer.memory=33554432 *Additional Points:* Only one broker was used. The transfer of records took place over Infiniband and the Kafka logs were on an SSD. The same issue occurred when we used a Spark Consumer. Any insight on why it occurred and how to solve it would be highly appreciated. Best regards, Sivaranjan M