Hi, messages are stored on disk in the Kafka (network) protocol format, so if you have a look at the protocol guide you'll see the pieces start coming together: https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets
Regards, Magnus 2015-12-03 18:18 GMT+01:00 Steve Graham <sggraha...@gmail.com>: > I am attempting to understand the details of the content of the log > segment file in Kafka. > > The documentation (http://kafka.apache.org/081/documentation.html#log) > suggests: > The exact binary format for messages is versioned and maintained as a > standard interface so message sets can be transfered between producer, > broker, and client without recopying or conversion when desirable. This > format is as follows: > > On-disk format of a message > > message length : 4 bytes (value: 1+4+n) > "magic" value : 1 byte > crc : 4 bytes > payload : n bytes > > > But I am struggling to map the documentation to what I see on the disk. > > I created a topic, named simple-topic, and added one message to it (via > the console producer). The message payload was “message1”. > > The DumpLogSegments tool shows: > Dumping /tmp/kafka-logs/sample-topic-0/00000000000000000000.log > Starting offset: 0 > offset: 0 position: 0 isvalid: true payloadsize: 8 magic: 0 compresscodec: > NoCompressionCodec crc: 3916773564 > > Taking a hex dump of the (only) log file: > sample-topic-0 sgg$ hexdump -C 00000000000000000000.log | more > 00000000 00 00 00 00 00 00 00 00 00 00 00 16 e9 75 38 bc > |.............u8.| > 00000010 00 00 ff ff ff ff 00 00 00 08 6d 65 73 73 61 67 > |..........messag| > 00000020 65 31 |e1| > 00000022 > > I tried to “reverse engineer” the contents, to see how it corresponds to > the documentation: > > Bytes 0-7 (00 00 00 00 00 00 00 00). I am not sure what this is, some > sort of filler? > Bytes 8-11 (00 00 00 16) seems to be some length field? Decimal 22, which > seems to correspond to the length of the entire message, but more than > 1+4+n than suggested by the documentation > Bytes 12-15 (e9 75 38 bc) this corresponds to the CRC (decimal > 3916773564). No problem here. > Bytes 16-17 (00 00) not sure what this is. > Bytes 18-21 (ff ff ff ff) not sure what this is. A “magic number”? But > that should be just one byte. Must be something else? > Bytes 22-25 (00 00 00 08) is the message payload size (8), this is the > value of “n” in the formula for message length, exactly the length of the > “message1” string. No problem here. > Bytes 26-33 (6d 65 73 73 61 67 65 31) is the payload (ascii: message1). > No problem here. > > Can anyone on the list help me reconcile the documentation to what I see > on the disk? Specifically: > a) what are the first 8 bytes supposed to represent? > b) the message length field as described as 1+4+n doesn’t correspond with > what I see on disk. It looks like 4 (crc) + 2 (??) + 4 (?magic number?) + > 4 (payload length) + 8 (n). What is the correct formula? > c) why does the CRC appear so early in the message (bytes 8-11), shouldn’t > the magic value appear before the CRC? > d) what is the way to interpret bytes 16-21? is the magic number in here > somewhere? What else is in this set of bytes? > > Thanks > sgg > >