[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-09-22 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r492506544



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionCodec.java
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+
+/**
+ * The codec for compression/decompression.
+ */
+public interface CompressionCodec {
+
+  /**
+   * Compress a buffer.
+   * @param allocator the allocator for allocating memory for compressed 
buffer.
+   * @param unCompressedBuffer the buffer to compress.
+   *   Implementation of this method should take care 
of releasing this buffer.
+   * @return the compressed buffer.
+   */
+  ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer);

Review comment:
   @emkornfield Do you think we can merge this now?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-09-21 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r492506544



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionCodec.java
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+
+/**
+ * The codec for compression/decompression.
+ */
+public interface CompressionCodec {
+
+  /**
+   * Compress a buffer.
+   * @param allocator the allocator for allocating memory for compressed 
buffer.
+   * @param unCompressedBuffer the buffer to compress.
+   *   Implementation of this method should take care 
of releasing this buffer.
+   * @return the compressed buffer.
+   */
+  ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer);

Review comment:
   @emkornfield Do you think we can merge this now?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-31 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r480613932



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionCodec.java
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+
+/**
+ * The codec for compression/decompression.
+ */
+public interface CompressionCodec {
+
+  /**
+   * Compress a buffer.
+   * @param allocator the allocator for allocating memory for compressed 
buffer.
+   * @param unCompressedBuffer the buffer to compress.
+   *   Implementation of this method should take care 
of releasing this buffer.
+   * @return the compressed buffer.
+   */
+  ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer);

Review comment:
   @emkornfield Thank you for starting this discussion and sharing your 
good ideas. 
   Your reasoning makes sense to me. 
   
   I guess I was looking at the problem from a different perspective. 
   
   IMO, the bottleneck of a compressing codec is the CPU resource, and the main 
purpose of compressing is to reduce memory/network bandwidth consumption.
   
   Given the above assumptions, we should try to do the compression as early as 
possible. The earliest possible place should be in the `getFieldBuffers` 
method. In this PR, we do it in `VectorUnLoader`, which is not the best, but 
close enough to the best. Similarly, we should try to do the decompression as 
late as possible. In this PR, we do it in `VectorLoader`, which is close to the 
optimal.
   
   Admittedly, we have additional copies after introducing the compression 
framework. However, both additional copies are based on the compressed data, 
with reduced data size, so the overhead should be small.
   
   The above reasoning is based on the assumption that the compression codec 
could effectively reduce the data size, which is not always true in practice. 
So I think we can make the decision based on the specific compression codec, 
and real benchmark data?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-28 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r479017596



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/ArrowRecordBatch.java
##
@@ -194,12 +190,17 @@ public int writeTo(FlatBufferBuilder builder) {
 int nodesOffset = FBSerializables.writeAllStructsToVector(builder, nodes);
 RecordBatch.startBuffersVector(builder, buffers.size());
 int buffersOffset = FBSerializables.writeAllStructsToVector(builder, 
buffersLayout);
-int compressOffset = bodyCompression.writeTo(builder);
+int compressOffset = 0;
+if (bodyCompression != null) {

Review comment:
   @emkornfield Thanks for your good idea. I have updated the code 
accordingly. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-11 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r468987274



##
File path: java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java
##
@@ -76,6 +97,10 @@ private void appendNodes(FieldVector vector, 
List nodes, List

[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-11 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r468987159



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionUtil.java
##
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.flatbuf.BodyCompressionMethod;
+import org.apache.arrow.flatbuf.CompressionType;
+import org.apache.arrow.vector.ipc.message.ArrowBodyCompression;
+
+/**
+ * Utilities for data compression/decompression.
+ */
+public class CompressionUtil {
+
+  private CompressionUtil() {
+  }
+
+  /**
+   * Creates the {@link ArrowBodyCompression} object, given the {@link 
CompressionCodec}.
+   * The implementation of this method should depend on the values of {@link 
CompressionType#names}.
+   */
+  public static ArrowBodyCompression createBodyCompression(CompressionCodec 
codec) {
+switch (codec.getCodecName()) {
+  case "default":
+return DefaultCompressionCodec.DEFAULT_BODY_COMPRESSION;
+  case "LZ4_FRAME":
+return new ArrowBodyCompression((byte) 0, 
BodyCompressionMethod.BUFFER);

Review comment:
   Changed to CompressionType.LZ4_FRAME and CompressionType.ZSTD, 
respectively. Thanks for your kind reminder. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-11 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r468986947



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java
##
@@ -408,11 +408,15 @@ public static ArrowRecordBatch 
deserializeRecordBatch(RecordBatch recordBatchFB,
   ArrowBuf vectorBuffer = body.slice(bufferFB.offset(), bufferFB.length());
   buffers.add(vectorBuffer);
 }
+
+ArrowBodyCompression bodyCompression =
+new ArrowBodyCompression(recordBatchFB.compression().codec(), 
recordBatchFB.compression().method());

Review comment:
   According to our current implementation. The compression object cannot 
be null.
   For the sake of safety, we added check here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-11 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r468986440



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/DefaultCompressionCodec.java
##
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.flatbuf.BodyCompressionMethod;
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.ipc.message.ArrowBodyCompression;
+
+/**
+ * The default compression codec that does no compression.
+ */
+public class DefaultCompressionCodec implements CompressionCodec {
+
+  public static final DefaultCompressionCodec INSTANCE = new 
DefaultCompressionCodec();
+
+  public static final byte COMPRESSION_TYPE = -1;
+
+  public static final ArrowBodyCompression DEFAULT_BODY_COMPRESSION =
+  new ArrowBodyCompression(COMPRESSION_TYPE, BodyCompressionMethod.BUFFER);
+
+  private DefaultCompressionCodec() {
+  }
+
+  @Override
+  public ArrowBuf compress(BufferAllocator allocator, ArrowBuf 
unCompressedBuffer) {

Review comment:
   I agree with your point. This makes it easier to incorporate the codec 
into the framework. 
   Added comments in the JavaDoc. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-11 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r468986032



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/DefaultCompressionCodec.java
##
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.flatbuf.BodyCompressionMethod;
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.ipc.message.ArrowBodyCompression;
+
+/**
+ * The default compression codec that does no compression.
+ */
+public class DefaultCompressionCodec implements CompressionCodec {

Review comment:
   Agreed and accepted. NoCompressionCodec is a better name. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-05 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r465621351



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java
##
@@ -403,16 +421,29 @@ public static ArrowRecordBatch 
deserializeRecordBatch(RecordBatch recordBatchFB,
   nodes.add(new ArrowFieldNode(node.length(), node.nullCount()));
 }
 List buffers = new ArrayList<>();
+long curOffset = 0L;
 for (int i = 0; i < recordBatchFB.buffersLength(); ++i) {
   Buffer bufferFB = recordBatchFB.buffers(i);
   ArrowBuf vectorBuffer = body.slice(bufferFB.offset(), bufferFB.length());
+  curOffset = bufferFB.offset() + bufferFB.length();
   buffers.add(vectorBuffer);
 }
+
+if (curOffset % 8 != 0) {
+  curOffset += 8 - curOffset % 8;
+}
+ArrowBodyCompression bodyCompression = null;

Review comment:
   In the revised PR, we no longer parse the buffer stream. Thanks for the 
suggestion. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-05 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r465622563



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/ArrowBodyCompression.java
##
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.ipc.message;
+
+import org.apache.arrow.flatbuf.BodyCompression;
+
+import com.google.flatbuffers.FlatBufferBuilder;
+
+/**
+ * Compression information about data written to a channel.
+ */
+public class ArrowBodyCompression implements FBSerializable {
+
+  /**
+   * Length of the serialized object.
+   */
+  public static final long BODY_COMPRESSION_LENGTH = 2L;

Review comment:
   Sounds reasonable. We have removed it in the revised PR. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-05 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r465622358



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/ArrowBodyCompression.java
##
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.ipc.message;
+
+import org.apache.arrow.flatbuf.BodyCompression;
+
+import com.google.flatbuffers.FlatBufferBuilder;
+
+/**
+ * Compression information about data written to a channel.
+ */
+public class ArrowBodyCompression implements FBSerializable {
+
+  /**
+   * Length of the serialized object.
+   */
+  public static final long BODY_COMPRESSION_LENGTH = 2L;
+
+  final byte[] data = new byte[(int) BODY_COMPRESSION_LENGTH];

Review comment:
   We have restored to separate named variables in the revised PR. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-05 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r465621861



##
File path: 
java/flight/flight-core/src/main/java/org/apache/arrow/flight/ArrowMessage.java
##
@@ -353,6 +361,23 @@ private InputStream asInputStream(BufferAllocator 
allocator) {
 // the reference count
 b.getReferenceManager().retain();
   }
+
+  // add compression info, if any
+  if (bodyCompression != null) {
+ArrowBuf compBuf = 
allocator.buffer(ArrowBodyCompression.BODY_COMPRESSION_LENGTH);
+compBuf.setByte(0, bodyCompression.getCodec());

Review comment:
   In the reivsed PR, we no longer manually serilize/deserialize the table. 
Thanks. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-05 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r465621532



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionCodec.java
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.memory.ArrowBuf;
+
+/**
+ * The codec for compression/decompression.
+ */
+public interface CompressionCodec {
+
+  /**
+   * Compress a buffer.
+   * @param input the buffer to compress.
+   * @return the compressed buffer.
+   */
+  ArrowBuf compress(ArrowBuf input);

Review comment:
   Sure. The suggested name is better. Thank you.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-05 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r465621351



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java
##
@@ -403,16 +421,29 @@ public static ArrowRecordBatch 
deserializeRecordBatch(RecordBatch recordBatchFB,
   nodes.add(new ArrowFieldNode(node.length(), node.nullCount()));
 }
 List buffers = new ArrayList<>();
+long curOffset = 0L;
 for (int i = 0; i < recordBatchFB.buffersLength(); ++i) {
   Buffer bufferFB = recordBatchFB.buffers(i);
   ArrowBuf vectorBuffer = body.slice(bufferFB.offset(), bufferFB.length());
+  curOffset = bufferFB.offset() + bufferFB.length();
   buffers.add(vectorBuffer);
 }
+
+if (curOffset % 8 != 0) {
+  curOffset += 8 - curOffset % 8;
+}
+ArrowBodyCompression bodyCompression = null;

Review comment:
   In the revised PR, we no longer parse the buffer stream.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-05 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r465619410



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java
##
@@ -266,8 +266,15 @@ public static ArrowBlock serialize(WriteChannel out, 
ArrowRecordBatch batch, Ipc
 long bufferLength = writeBatchBuffers(out, batch);
 Preconditions.checkArgument(bufferLength % 8 == 0, "out is not aligned");
 
+long compLength = 0L;
+if (batch.getBodyCompression() != null) {
+  compLength = writeCompressionBody(out, batch.getBodyCompression());
+  Preconditions.checkArgument(compLength == 
ArrowBodyCompression.BODY_COMPRESSION_LENGTH,
+  "deserialized compression body length not equal to 
ArrowBodyCompression#BODY_COMPRESSION_LENGTH");

Review comment:
   This check has been removed too. Thanks. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-05 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r465619212



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java
##
@@ -266,8 +266,15 @@ public static ArrowBlock serialize(WriteChannel out, 
ArrowRecordBatch batch, Ipc
 long bufferLength = writeBatchBuffers(out, batch);
 Preconditions.checkArgument(bufferLength % 8 == 0, "out is not aligned");
 
+long compLength = 0L;
+if (batch.getBodyCompression() != null) {
+  compLength = writeCompressionBody(out, batch.getBodyCompression());
+  Preconditions.checkArgument(compLength == 
ArrowBodyCompression.BODY_COMPRESSION_LENGTH,

Review comment:
   You are right. This check is removed. Thanks. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-05 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r465618946



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/ArrowRecordBatch.java
##
@@ -232,6 +252,12 @@ public long computeBodyLength() {
   // round up size to the next multiple of 8
   size = DataSizeRoundingUtil.roundUpTo8Multiple(size);
 }
+
+if (bodyCompression != null) {

Review comment:
   Sounds reasonable. @rymurr had a similar suggestion.
   
   To make the bodyCompression non-null, we need a default compression codec 
that makes no compression/decompression. However, our specification does not 
support such an option now (Please see 
https://github.com/apache/arrow/blob/master/format/Message.fbs#L45-L53). 
Providing one would make the implementation not aligning with the specification.
   
   However, we have revised the PR to provide one, because it seems the C++ 
implementation is not aligning with the specification either (pls see 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression.h#L33)
   
   After that, we can guarantee that the bodyCompression is never null, and 
some other problems also go away. 
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-05 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r465613604



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionCodec.java
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.memory.ArrowBuf;
+
+/**
+ * The codec for compression/decompression.
+ */
+public interface CompressionCodec {
+
+  /**
+   * Compress a buffer.
+   * @param input the buffer to compress.
+   * @return the compressed buffer.
+   */
+  ArrowBuf compress(ArrowBuf input);

Review comment:
   Sounds reasonable. I have revised the code to add allocator to the 
interface. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-08-05 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r465613822



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionUtility.java
##
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.flatbuf.BodyCompressionMethod;
+import org.apache.arrow.flatbuf.CompressionType;
+import org.apache.arrow.vector.ipc.message.ArrowBodyCompression;
+
+/**
+ * Utilities for data compression/decompression.
+ */
+public class CompressionUtility {

Review comment:
   Agreed. I have changed the name to CompressionUtil.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-06-29 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r447442441



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionCodec.java
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.memory.ArrowBuf;
+
+/**
+ * The codec for compression/decompression.
+ */
+public interface CompressionCodec {
+
+  /**
+   * Given a buffer, estimate the compressed size.
+   * Please note this operation is optional, and some compression methods may 
not support it.
+   *
+   * @param input the input buffer to be estimated.
+   * @return the estimated size of the compressed data.
+   */
+  long estimateCompressedSize(ArrowBuf input);

Review comment:
   Sounds reasonable. Thank you.
   I have revised the code to simplify the interface. Maybe we can add the 
optimization in the future, if we belive it is necessary. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-06-29 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r447441667



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionUtility.java
##
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.flatbuf.BodyCompressionMethod;
+import org.apache.arrow.flatbuf.CompressionType;
+import org.apache.arrow.vector.ipc.message.ArrowBodyCompression;
+
+/**
+ * Utilities for data compression/decompression.
+ */
+public class CompressionUtility {
+
+  private CompressionUtility() {
+  }
+
+  /**
+   * Creates the {@link ArrowBodyCompression} object, given the {@link 
CompressionCodec}.
+   */
+  public static ArrowBodyCompression createBodyCompression(CompressionCodec 
codec) {
+if (codec == null) {
+  return null;
+}
+for (int i = 0; i < CompressionType.names.length; i++) {
+  if (CompressionType.names[i].equals(codec.getCodecName())) {

Review comment:
   Thanks for the suggestion. I have revised the code to implement it 
through a switch statement. Please check if it looks good.
   
   It looks clearer. However, we may need to change the code whenever we 
support a new compression method. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-06-09 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r436654884



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionUtility.java
##
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.flatbuf.BodyCompressionMethod;
+import org.apache.arrow.flatbuf.CompressionType;
+import org.apache.arrow.vector.ipc.message.ArrowBodyCompression;
+
+/**
+ * Utilities for data compression/decompression.
+ */
+public class CompressionUtility {
+
+  private CompressionUtility() {
+  }
+
+  /**
+   * Creates the {@link ArrowBodyCompression} object, given the {@link 
CompressionCodec}.
+   */
+  public static ArrowBodyCompression createBodyCompression(CompressionCodec 
codec) {
+if (codec == null) {
+  return null;
+}
+for (int i = 0; i < CompressionType.names.length; i++) {
+  if (CompressionType.names[i].equals(codec.getCodecName())) {

Review comment:
   Thanks for your comment. I agree with you. 
   The problem is that class `org.apache.arrow.flatbuf.CompressionType` is 
automatically generated by flatbuf, and it is not implemented by an enum. 
Instead, it has separate fields for ordinals and names. In addition, it only 
provides function to convert ordinals to names, but not vice versa. 

##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionCodec.java
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.memory.ArrowBuf;
+
+/**
+ * The codec for compression/decompression.
+ */
+public interface CompressionCodec {
+
+  /**
+   * Given a buffer, estimate the compressed size.
+   * Please note this operation is optional, and some compression methods may 
not support it.
+   *
+   * @param input the input buffer to be estimated.
+   * @return the estimated size of the compressed data.
+   */
+  long estimateCompressedSize(ArrowBuf input);

Review comment:
   Simply compress/decompress would be OK, of course. 
   
   The estimateCompressedSize/estimateDecompressedSize APIs are mainly provided 
for performance benefits. In particular, if we can estimate the sizes with 
accuracy, we will
   1. Avoid allocating memory spaces larger than necessary.
   2. Avoid reallocating memory space during compression/decompression
   3. Avoid data copy caused by reallocations. 
   
   Due to the above benefits, I would prefer to add the APIs to the interface, 
but I am open to others' opinons and willing to change my mind if necessary. 

##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java
##
@@ -300,6 +307,20 @@ public static long writeBatchBuffers(WriteChannel out, 
ArrowRecordBatch batch) t
 return out.getCurrentPosition() - bufferStart;
   }
 
+  /**
+   * Serialize the compression body.
+   */
+  public static long writeCompressionBody(
+  WriteChannel out, ArrowBodyCompression bodyCompression) throws 
IOException {
+long bufferStart = out.getCurrentPosition();
+ByteBuffer buf = ByteBuffer.allocate(2);

Review comment:
   I have revised the implementation of `ArrowBodyCompression ` so that it 
is based on byte[]. Thanks a lot for the good suggestion. 

##
File path: 
java/vecto