Very helpful, thanks Magnus. Should the documentation found in http://kafka.apache.org/documentation.html#messages be updated to reflect this format for messages?
So, to close this one out, this is the section in the protocol guide (https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets) I used: Variable Length Primitives bytes, string - These types consist of a signed integer giving a length N followed by N bytes of content. A length of -1 indicates null. string uses an int16 for its size, and bytes uses an int32. … MessageSet => [Offset MessageSize Message] Offset => int64 MessageSize => int32 Message => Crc MagicByte Attributes Key Value Crc => int32 MagicByte => int8 Attributes => int8 Key => bytes Value => bytes And therefore I can interpret the bytes from the hex dump as follows: Bytes 0-7 (00 00 00 00 00 00 00 00) MessageSet offset Bytes 8-11 (00 00 00 16) MessageSetSize (crc (4) + magicbyte (1) + attributes (1) + key (4+0) + value (4+8) = 22 decimal, 16 hex Bytes 12-15 (e9 75 38 bc) CRC Byte 16 (00) magic byte Byte 17 (00) attributes Bytes 18-21 (ff ff ff ff) length field of the key, with -1 meaning key is null, no key bytes follow Bytes 22-25 (00 00 00 08) length field of the value Bytes 26-33 (6d 65 73 73 61 67 65 31) the value bytes > On Dec 3, 2015, at 4:14 PM, Magnus Edenhill <mag...@edenhill.se> wrote: > > Hi, > > messages are stored on disk in the Kafka (network) protocol format, so if > you have a look at the protocol guide you'll see the pieces start coming > together: > https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets > > Regards, > Magnus > > > > 2015-12-03 18:18 GMT+01:00 Steve Graham <sggraha...@gmail.com>: > >> I am attempting to understand the details of the content of the log >> segment file in Kafka. >> >> The documentation (http://kafka.apache.org/081/documentation.html#log) >> suggests: >> The exact binary format for messages is versioned and maintained as a >> standard interface so message sets can be transfered between producer, >> broker, and client without recopying or conversion when desirable. This >> format is as follows: >> >> On-disk format of a message >> >> message length : 4 bytes (value: 1+4+n) >> "magic" value : 1 byte >> crc : 4 bytes >> payload : n bytes >> >> >> But I am struggling to map the documentation to what I see on the disk. >> >> I created a topic, named simple-topic, and added one message to it (via >> the console producer). The message payload was “message1”. >> >> The DumpLogSegments tool shows: >> Dumping /tmp/kafka-logs/sample-topic-0/00000000000000000000.log >> Starting offset: 0 >> offset: 0 position: 0 isvalid: true payloadsize: 8 magic: 0 compresscodec: >> NoCompressionCodec crc: 3916773564 >> >> Taking a hex dump of the (only) log file: >> sample-topic-0 sgg$ hexdump -C 00000000000000000000.log | more >> 00000000 00 00 00 00 00 00 00 00 00 00 00 16 e9 75 38 bc >> |.............u8.| >> 00000010 00 00 ff ff ff ff 00 00 00 08 6d 65 73 73 61 67 >> |..........messag| >> 00000020 65 31 |e1| >> 00000022 >> >> I tried to “reverse engineer” the contents, to see how it corresponds to >> the documentation: >> >> Bytes 0-7 (00 00 00 00 00 00 00 00). I am not sure what this is, some >> sort of filler? >> Bytes 8-11 (00 00 00 16) seems to be some length field? Decimal 22, which >> seems to correspond to the length of the entire message, but more than >> 1+4+n than suggested by the documentation >> Bytes 12-15 (e9 75 38 bc) this corresponds to the CRC (decimal >> 3916773564). No problem here. >> Bytes 16-17 (00 00) not sure what this is. >> Bytes 18-21 (ff ff ff ff) not sure what this is. A “magic number”? But >> that should be just one byte. Must be something else? >> Bytes 22-25 (00 00 00 08) is the message payload size (8), this is the >> value of “n” in the formula for message length, exactly the length of the >> “message1” string. No problem here. >> Bytes 26-33 (6d 65 73 73 61 67 65 31) is the payload (ascii: message1). >> No problem here. >> >> Can anyone on the list help me reconcile the documentation to what I see >> on the disk? Specifically: >> a) what are the first 8 bytes supposed to represent? >> b) the message length field as described as 1+4+n doesn’t correspond with >> what I see on disk. It looks like 4 (crc) + 2 (??) + 4 (?magic number?) + >> 4 (payload length) + 8 (n). What is the correct formula? >> c) why does the CRC appear so early in the message (bytes 8-11), shouldn’t >> the magic value appear before the CRC? >> d) what is the way to interpret bytes 16-21? is the magic number in here >> somewhere? What else is in this set of bytes? >> >> Thanks >> sgg >> >>