[ https://issues.apache.org/jira/browse/KAFKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jay Kreps updated KAFKA-595: ---------------------------- Labels: feature (was: feature features) Description: In 0.7 Kafka always appended messages to the log using whatever compression codec the client used. In 0.8, after the KAFKA-506 patch, the master always recompresses the message before appending to the log to assign ids. Currently the server uses a funky heuristic to choose a compression codec based on the codecs the producer used. This doesn't actually make that much sense. It would be better for the server to have its own compression (a global default and per-topic override) that specified the compression codec, and have the server always recompress with this codec regardless of the original codec. Compression currently happens in kafka.log.Log.assignOffsets (perhaps should be renamed if it takes on compression as an official responsibility instead of a side-effect). was: Compression can be used to store something in less space (less IO) and/or transfer it less expensively (better use of network bandwidth). Often the two go hand in hand, such as when compressed data is written to a disk: the disk I/O takes less time, since less bits are being transferred, and the storage occupied on the disk after the transfer is less. Unfortunately, the time to compress the data can exceed the savings gained from transferring less data, resulting in overall degradation. After KAFKA-506, the network usage gains we used to get by compressing data at the producers is exceeded by the cost of decompressing and re-compressing data at the server side. Compression to save on network costs must be done either to reduce the contention in a wide-area network due to multiple point to point connections OR to efficiently transfer data over low-bandwidth networks (cross DC). In the case of producer-server connections, neither is typically true, which means we might not benefit from producer side compression at all in most production deployments of Kafka. On the contrary, it might actually hurt performance since most production deployments turn on compression for all topics. The main benefit of compressing data in Kafka is to efficiently transfer data cross DC for setting up mirrored Kafka clusters. The performance benefit is also true for real time consumers, especially when there are multiple groups of consumers consuming the same topic. If data is compressed on the server side instead, which we do anyways, we can get the I/O savings as well as efficient network transfer on the server-consumer links. I don't have numbers to quantify the performance impact of re-compression now, since there are other changes that need to be done to test this correctly. Thoughts ? Summary: Decouple producer side compression from server-side compression. (was: Producer side compression is unnecessary) > Decouple producer side compression from server-side compression. > ---------------------------------------------------------------- > > Key: KAFKA-595 > URL: https://issues.apache.org/jira/browse/KAFKA-595 > Project: Kafka > Issue Type: Improvement > Affects Versions: 0.8 > Reporter: Neha Narkhede > Labels: feature > > In 0.7 Kafka always appended messages to the log using whatever compression > codec the client used. In 0.8, after the KAFKA-506 patch, the master always > recompresses the message before appending to the log to assign ids. Currently > the server uses a funky heuristic to choose a compression codec based on the > codecs the producer used. This doesn't actually make that much sense. It > would be better for the server to have its own compression (a global default > and per-topic override) that specified the compression codec, and have the > server always recompress with this codec regardless of the original codec. > Compression currently happens in kafka.log.Log.assignOffsets (perhaps should > be renamed if it takes on compression as an official responsibility instead > of a side-effect). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira