[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2016-01-27 Thread Dana Powers (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119703#comment-15119703
 ] 

Dana Powers commented on KAFKA-1493:


Hi all - it appears that the header checksum (HC) byte is incorrect. The Kafka 
implementation hashes the magic bytes + header, but the spec is to only hash 
header (don't include magic).

We are having some trouble encoding/decoding from non-java clients because the 
framing must be munged before reading / writing to kafka. Is this known? I 
don't see another JIRA for it. Should I file separately or should this be 
reopened?

> Use a well-documented LZ4 compression format and remove redundant LZ4HC option
> --
>
> Key: KAFKA-1493
> URL: https://issues.apache.org/jira/browse/KAFKA-1493
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.8.2.0
>Reporter: James Oliver
>Assignee: James Oliver
>Priority: Blocker
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1493.patch, KAFKA-1493.patch, 
> KAFKA-1493_2014-10-16_13:49:34.patch, KAFKA-1493_2014-10-16_21:25:23.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2016-01-27 Thread Magnus Edenhill (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119920#comment-15119920
 ] 

Magnus Edenhill commented on KAFKA-1493:


[~dana.powers] I can confirm this is the case, as you describe it. I suggest 
creating a new issue for this.
I have a patch that adds a new compression.type=lz4f with proper framing.

> Use a well-documented LZ4 compression format and remove redundant LZ4HC option
> --
>
> Key: KAFKA-1493
> URL: https://issues.apache.org/jira/browse/KAFKA-1493
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.8.2.0
>Reporter: James Oliver
>Assignee: James Oliver
>Priority: Blocker
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1493.patch, KAFKA-1493.patch, 
> KAFKA-1493_2014-10-16_13:49:34.patch, KAFKA-1493_2014-10-16_21:25:23.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2016-01-27 Thread Dana Powers (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120072#comment-15120072
 ] 

Dana Powers commented on KAFKA-1493:


filed KAFKA-3160

> Use a well-documented LZ4 compression format and remove redundant LZ4HC option
> --
>
> Key: KAFKA-1493
> URL: https://issues.apache.org/jira/browse/KAFKA-1493
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.8.2.0
>Reporter: James Oliver
>Assignee: James Oliver
>Priority: Blocker
> Fix For: 0.8.2.0
>
> Attachments: KAFKA-1493.patch, KAFKA-1493.patch, 
> KAFKA-1493_2014-10-16_13:49:34.patch, KAFKA-1493_2014-10-16_21:25:23.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-16 Thread James Oliver (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174217#comment-14174217
 ] 

James Oliver commented on KAFKA-1493:
-

Updated reviewboard https://reviews.apache.org/r/26658/diff/
 against branch origin/trunk

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: Ivan Lyutov
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch, KAFKA-1493.patch, 
 KAFKA-1493_2014-10-16_13:49:34.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-16 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174305#comment-14174305
 ] 

Jun Rao commented on KAFKA-1493:


James,

Thanks for the patch. There are a few things marked as todo in the patch. Are 
those required? Do you think you have time to finish the patch for 0.8.2?

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: James Oliver
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch, KAFKA-1493.patch, 
 KAFKA-1493_2014-10-16_13:49:34.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-16 Thread James Oliver (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174356#comment-14174356
 ] 

James Oliver commented on KAFKA-1493:
-

Jun,

My pleasure. The TODOs are parts of the specification that are unimplemented, 
but are not required. I left them in there as hints if/when the spec is 
contributed back to lz4-java. The validation routines will disallow the use of 
any portion of the spec that is unimplemented, but it's totally usable.

What the spec can do - compress  decompress messages using 64kb/256kb/1mb/4mb 
blockSize (64kb by default) with optional block checksums (disabled by default)
What the spec cannot do - decompress messages compressed by an implementation 
supporting some of the missing features. If this were to occur, a 
RuntimeException with detailed information will be thrown.

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: James Oliver
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch, KAFKA-1493.patch, 
 KAFKA-1493_2014-10-16_13:49:34.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-16 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174473#comment-14174473
 ] 

Jun Rao commented on KAFKA-1493:


James,

Thanks for the answer. We can leave the TODOs there. The patch looks good to 
me. Could you look at the comments in the RB?

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: James Oliver
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch, KAFKA-1493.patch, 
 KAFKA-1493_2014-10-16_13:49:34.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-16 Thread James Oliver (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174716#comment-14174716
 ] 

James Oliver commented on KAFKA-1493:
-

Updated reviewboard https://reviews.apache.org/r/26658/diff/
 against branch origin/trunk

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: James Oliver
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch, KAFKA-1493.patch, 
 KAFKA-1493_2014-10-16_13:49:34.patch, KAFKA-1493_2014-10-16_21:25:23.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-14 Thread James Oliver (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171082#comment-14171082
 ] 

James Oliver commented on KAFKA-1493:
-

Sorry to not be more clear - I fixed a few spots related to the removal of the 
LZ4HC option, but left the I/O streams in Ivan's patch alone. Since I didn't 
have permissions to update Ivan's reviewboard, I created a new review.

1. This looks like Ivan's interpretation of the lz4-java block stream format.
2. We should use neither - the lz4-java impl was used previously (KAFKA-1456). 
Review by the community produced this issue. We need a real implementation of 
http://fastcompression.blogspot.com/2013/04/lz4-streaming-format-final.html

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: Ivan Lyutov
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch, KAFKA-1493.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-14 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171732#comment-14171732
 ] 

Jun Rao commented on KAFKA-1493:


James,

Thanks, got it now. Not sure how long it will take to get a real implementation 
of http://fastcompression.blogspot.com/2013/04/lz4-streaming-format-final.html. 
Should we just take out LZ4 in CompressionType and CompressionCodec in 0.8.2 so 
that people don't use it until it's fixed?

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: Ivan Lyutov
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch, KAFKA-1493.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-14 Thread James Oliver (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171748#comment-14171748
 ] 

James Oliver commented on KAFKA-1493:
-

I implemented the OutputStream today. If I can't get the InputStream done and 
tested before I leave for vacation Thursday, IMO we should take it out.

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: Ivan Lyutov
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch, KAFKA-1493.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-13 Thread James Oliver (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169616#comment-14169616
 ] 

James Oliver commented on KAFKA-1493:
-

Sure, I'll take a look at it now.

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: Ivan Lyutov
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-13 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170071#comment-14170071
 ] 

Jun Rao commented on KAFKA-1493:


James,

Thanks for the patch. A couple of more questions.

1. The following header frame used in the patch doesn't seem to match exactly 
what's described in 
http://fastcompression.blogspot.com/2013/04/lz4-streaming-format-final.html. 
So, we are inventing our own header? Is that ok?
/*
* Message format:
* HEADER which consists of:
* 1) magic byte sequence (8 bytes)
* 2) compression method token (1 byte)
* 3) compressed length (4 bytes)
* 4) original message length (4 bytes)
* and compressed message itself
* Block size: 64 Kb
* */

2. If the io stream code in this patch is identical to that in lz4-java, could 
we just use lz4-java instead?

Thanks,

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: Ivan Lyutov
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch, KAFKA-1493.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-10 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167153#comment-14167153
 ] 

Jun Rao commented on KAFKA-1493:


James,

Could you help review the format in Ivan's patch? Is the format used in 
KafkaLZ4BlockInputStream standard? I am wondering if there are libraries in 
other languages that support this format too. Thanks,

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: Ivan Lyutov
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-09 Thread Ivan Lyutov (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165275#comment-14165275
 ] 

Ivan Lyutov commented on KAFKA-1493:


Created reviewboard https://reviews.apache.org/r/26503/diff/
 against branch apache/trunk

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: James Oliver
Priority: Blocker
 Fix For: 0.8.2

 Attachments: KAFKA-1493.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-03 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158291#comment-14158291
 ] 

Jun Rao commented on KAFKA-1493:


The easiest thing is probably to just take out LZ4 in CompressionType and 
CompressionCodec in 0.8.2. 

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: James Oliver
Priority: Blocker
 Fix For: 0.8.2






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-10-03 Thread Theo Hultberg (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158408#comment-14158408
 ] 

Theo Hultberg commented on KAFKA-1493:
--

If you're looking for a standard way to handle LZ4 there doesn't seem to be 
any, but Cassandra uses a 4 byte field for the uncompressed length and no 
checksum 
(https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/compress/LZ4Compressor.java).

I've seen varint used too in other projects, but in my opinion it's a pain to 
implement compared to just using an int, and for very little benefit. The 
drawbacks are that small messages will use one or two bytes more, and that you 
can't handle compressed chunks of over a couple of gigabyte.

Sorry for jumping into the discussion out of the blue, I just stumbled upon 
this while looking through the issues for 0.8.2. I've got very little 
experience with the Kafka codebase, but I'm the author of the Ruby driver for 
Cassandra and I recognized the issue. Hope this was helpful and I didn't 
completely miss the point.

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: James Oliver
Priority: Blocker
 Fix For: 0.8.2






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-09-20 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142132#comment-14142132
 ] 

Jun Rao commented on KAFKA-1493:


Since this is a blocker for 0.8.2, if we can't get this fixed in the next few 
days, I suggest that we just remove the documentation in producerConfig about 
the LZ4 and leave LZ4 an unsupported compression codec for now.

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Assignee: James Oliver
Priority: Blocker
 Fix For: 0.8.2






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-09-12 Thread James Oliver (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131758#comment-14131758
 ] 

James Oliver commented on KAFKA-1493:
-

I have today to work on this, I will see how far I can get. 

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Priority: Blocker
 Fix For: 0.8.2






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-09-04 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122037#comment-14122037
 ] 

Guozhang Wang commented on KAFKA-1493:
--

Could we still have that for 0.8.2?

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: James Oliver
Priority: Blocker
 Fix For: 0.8.2






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-06-16 Thread James Oliver (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032832#comment-14032832
 ] 

James Oliver commented on KAFKA-1493:
-

Snappy's block (default size 32kb) compression format is this:
snappy codec header: 8-byte magic header, version [4-byte integer], min 
compatible version [4-byte integer]
compressed block 1: compressed data size [4-byte integer], compressed data
compressed block 2
...
Notable limitations: no checksum

If I understand the proposed format correctly, this is what you're suggesting:
uncompressed data size [n-byte varint], compressed data

While I would expect compressing an entire message as a single block would 
provide a better compression ratio than compressing smaller chunks, doing so 
for larger messages is going to cause serious performance problems.

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Reporter: James Oliver
 Fix For: 0.8.2






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-06-16 Thread Stephan Lachowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033211#comment-14033211
 ] 

Stephan Lachowsky commented on KAFKA-1493:
--

Given the way that the decoder works I think that storing the uncompressed size 
would be the appropriate thing to do. The compressed length can be inferred.  
This allows the reader of the stream to allocate the minimum required memory 
for a single-shot decode.

I've been looking at how the default blocksize is passed down to the various 
compression backends, the java and scala code paths look like they do different 
things.

The current java code passes the blocksize into the decoder from the Compressor 
constructor (Compressor.java:59 and 214).  It appears that MemoryRecords is the 
only user of the java code and it uses the constructor which doesn't explicitly 
pass a blocksize resulting in fallback to the (tiny) default of 1024.

The scala code path in CompressionFactory.scala appears to use just the default 
constructors for the existing stream wrapper, which means that the compressors 
will use their own internal default blocksizes.  It looks like the scala code 
has all the messages on heap already.

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Reporter: James Oliver
 Fix For: 0.8.2






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-06-16 Thread Stephan Lachowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033215#comment-14033215
 ] 

Stephan Lachowsky commented on KAFKA-1493:
--

The lack of checksum in the compressed data is not much of a drawback, IMHO, 
there is already a CRC32 over the entire message including compressed data.

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Reporter: James Oliver
 Fix For: 0.8.2






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-06-16 Thread James Oliver (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033274#comment-14033274
 ] 

James Oliver commented on KAFKA-1493:
-

I agree that storing the uncompressed length as a varint makes logical sense 
for allocating the required heap space IFF the entire uncompressed message is 
destined for the heap. Otherwise, this strategy introduces unnecessary heap 
requirements. I also agree that the checksum doesn't buy us much... IMO LZ4 is 
mature enough to not worry about distortion, and as you mentioned we already 
checksum the compressed message to verify accurate transmission.

Looks like the LZ4 Java path doesn't pass that default blockSize to the 
underlying stream, which should be changed (if we go with the LZ4Block 
streams). That being said, the ultra-small block size is robbing 
performance...we should consider bumping it up to something in the 32-64kb 
range to improve our compression ratio and reduce block overhead.

We could just compress the entire message as [~alb...@stonethree.com] mentioned 
and document the heap requirements, but it doesn't look like any of the other 
compression codecs do so and I'm hesitant to change the way LZ4 would work... 
partially implementing 
https://docs.google.com/document/d/1gZbUoLw5hRzJ5Q71oPRN6TO4cRMTZur60qip-TE7BhQ/edit?pli=1
 might still be our best option.

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Reporter: James Oliver
 Fix For: 0.8.2






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (KAFKA-1493) Use a well-documented LZ4 compression format and remove redundant LZ4HC option

2014-06-13 Thread Albert Strasheim (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030908#comment-14030908
 ] 

Albert Strasheim commented on KAFKA-1493:
-

What does the format look like for a Snappy compressed message?

One might simply need a varint-encoded field for the uncompressed length 
followed by a compressed block.

The LZ4 streaming format and the xxhash, etc. in there might be overkill.

 Use a well-documented LZ4 compression format and remove redundant LZ4HC option
 --

 Key: KAFKA-1493
 URL: https://issues.apache.org/jira/browse/KAFKA-1493
 Project: Kafka
  Issue Type: Improvement
Reporter: James Oliver
 Fix For: 0.8.2






--
This message was sent by Atlassian JIRA
(v6.2#6252)