[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/orc/pull/126


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai
Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118057983
  
--- Diff: c++/test/TestByteRLEEncoder.cc ---
@@ -0,0 +1,231 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "ByteRLE.hh"
+#include "MemoryOutputStream.hh"
+
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#include 
+
+namespace orc {
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
+
+  void generateNotNull(uint64_t numValues,
+   uint64_t numNulls,
+   char ** notNull) {
+if (numNulls != 0 && notNull != nullptr) {
+  *notNull = new char[numValues];
+  memset(*notNull, 1, numValues);
+  while (numNulls > 0) {
+uint64_t pos = static_cast(std::rand()) % numValues;
+if ((*notNull)[pos]) {
+  (*notNull)[pos] = static_cast(0);
+  --numNulls;
+}
+  }
+}
+  }
+
+  char * generateData(uint64_t numValues,
+  uint64_t numNulls = 0,
+  char ** notNull = nullptr) {
+generateNotNull(numValues, numNulls, notNull);
+char * res = new char[numValues];
+for (uint64_t i = 0; i < numValues; ++i) {
+res[i] = static_cast(std::rand() % 256);
+}
+return res;
+  }
+
+  char * generateBoolData(uint64_t numValues,
+  uint64_t numNulls = 0,
+  char ** notNull = nullptr) {
+generateNotNull(numValues, numNulls, notNull);
+char * res = new char[numValues];
+for (uint64_t i = 0; i < numValues; ++i) {
+res[i] = static_cast(std::rand() % 2);
+  }
+return res;
+  }
+
+  void decodeAndVerify(
+   const MemoryOutputStream& memStream,
+   char * data,
+   uint64_t numValues,
+   char* notNull) {
+
+std::unique_ptr inStream(
+  new SeekableArrayInputStream(memStream.getData(), 
memStream.getLength()));
+
+std::unique_ptr decoder =
+  createByteRleDecoder(std::move(inStream));
+
+char* decodedData = new char[numValues];
--- End diff --

I don't see a huge difference here - we are manipulating raw buffer anyway. 
I would just leave them there unless you feel strongly about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai
Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118057495
  
--- Diff: c++/test/TestByteRLEEncoder.cc ---
@@ -0,0 +1,231 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "ByteRLE.hh"
+#include "MemoryOutputStream.hh"
+
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#include 
+
+namespace orc {
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
+
+  void generateNotNull(uint64_t numValues,
+   uint64_t numNulls,
+   char ** notNull) {
+if (numNulls != 0 && notNull != nullptr) {
+  *notNull = new char[numValues];
+  memset(*notNull, 1, numValues);
+  while (numNulls > 0) {
+uint64_t pos = static_cast(std::rand()) % numValues;
+if ((*notNull)[pos]) {
+  (*notNull)[pos] = static_cast(0);
+  --numNulls;
+}
+  }
+}
+  }
+
+  char * generateData(uint64_t numValues,
+  uint64_t numNulls = 0,
+  char ** notNull = nullptr) {
+generateNotNull(numValues, numNulls, notNull);
+char * res = new char[numValues];
--- End diff --

I have fixed these as well as the same issues in TestRLEv1Encoder.cc. See 
my next iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai
Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118048655
  
--- Diff: c++/src/RLEv1.hh ---
@@ -26,6 +26,59 @@
 
 namespace orc {
 
+class RleEncoderV1 : public RleEncoder {
+public:
+RleEncoderV1(std::unique_ptr outStream,
+ bool hasSigned);
+~RleEncoderV1();
+
+/**
+ * Encode the next batch of values.
+ * @param data the array to be written
+ * @param numValues the number of values to write
+ * @param notNull If the pointer is null, all values are writen. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+void add(const int64_t* data, uint64_t numValues,
+ const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+uint64_t getBufferSize() const override {
+return outputStream->getSize();
+}
+
+/**
+ * Flushing underlying BufferedOutputStream
+ */
+uint64_t flush() override;
+
+/**
+ * record current position
+ * @param recorder use the recorder to record current positions
+ */
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+private:
+std::unique_ptr outputStream;
+bool isSigned;
+int64_t* literals;
+int numLiterals;
+int64_t delta;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai
Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118048627
  
--- Diff: c++/src/RLEv1.cc ---
@@ -26,8 +26,173 @@
 namespace orc {
 
 const uint64_t MINIMUM_REPEAT = 3;
+const uint64_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+
 const uint64_t BASE_128_MASK = 0x7f;
 
+const int MAX_DELTA = 127;
+const int MIN_DELTA = -128;
+const int MAX_LITERAL_SIZE = 128;
+
+RleEncoderV1::RleEncoderV1(
+  std::unique_ptr outStream,
+  bool hasSigned):
+  outputStream(std::move(outStream)) {
+  isSigned = hasSigned;
+  literals = new int64_t[MAX_LITERAL_SIZE];
--- End diff --

Same as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai
Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118048556
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
+int numLiterals;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
+
+void writeByte(char c);
+void writeValues();
+void write(char c);
+  };
+
+  ByteRleEncoderImpl::ByteRleEncoderImpl(
+std::unique_ptr 
output)
+  : outputStream(std::move(output)) {
+literals = new char[MAX_LITERAL_SIZE];
--- End diff --

thanks for catching that. I add a free() call. Regarding to RAII, I think 
it's a bit overkilled here. The array is guaranteed to be allocated in 
constructor and freed in destructor. The logic is quite clear. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai
Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118048038
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
+int numLiterals;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai
Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118048019
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai
Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118046740
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
+int numLiterals;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
+
+void writeByte(char c);
+void writeValues();
+void write(char c);
+  };
+
+  ByteRleEncoderImpl::ByteRleEncoderImpl(
+std::unique_ptr 
output)
+  : outputStream(std::move(output)) {
+literals = new char[MAX_LITERAL_SIZE];
+numLiterals = 0;
+tailRunLength = 0;
+repeat = false;
+bufferPosition = 0;
+bufferLength = 0;
+buffer = nullptr;
+  }
+
+  ByteRleEncoderImpl::~ByteRleEncoderImpl() {
+// PASS
+  }
+
+  void ByteRleEncoderImpl::writeByte(char c) {
+if (bufferPosition == bufferLength) {
+  int addedSize = 0;
+  if (!outputStream->Next(reinterpret_cast(), 
)) {
+throw std::bad_alloc();
--- End diff --

Next() is a method we inherit from protobuf ZeroCopyOutputStream. Current 
implementation of BufferedOutputStream doesn't return false, but we still want 
this check to comply with the function signature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai
Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118044090
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
+int numLiterals;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
+
+void writeByte(char c);
+void writeValues();
+void write(char c);
+  };
+
+  ByteRleEncoderImpl::ByteRleEncoderImpl(
+std::unique_ptr 
output)
+  : outputStream(std::move(output)) {
+literals = new char[MAX_LITERAL_SIZE];
--- End diff --

I use unique_ptr here to indicate ByteRleEncoderImpl has full ownership of 
this buffer. Thanks for catching that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r117873593
  
--- Diff: c++/src/RLEv1.cc ---
@@ -26,8 +26,173 @@
 namespace orc {
 
 const uint64_t MINIMUM_REPEAT = 3;
+const uint64_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+
 const uint64_t BASE_128_MASK = 0x7f;
 
+const int MAX_DELTA = 127;
+const int MIN_DELTA = -128;
+const int MAX_LITERAL_SIZE = 128;
+
+RleEncoderV1::RleEncoderV1(
+  std::unique_ptr outStream,
+  bool hasSigned):
+  outputStream(std::move(outStream)) {
+  isSigned = hasSigned;
+  literals = new int64_t[MAX_LITERAL_SIZE];
--- End diff --

use `std::vector` literals_buf as a buffer instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r117874336
  
--- Diff: c++/test/TestByteRLEEncoder.cc ---
@@ -0,0 +1,231 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "ByteRLE.hh"
+#include "MemoryOutputStream.hh"
+
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#include 
+
+namespace orc {
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
+
+  void generateNotNull(uint64_t numValues,
+   uint64_t numNulls,
+   char ** notNull) {
+if (numNulls != 0 && notNull != nullptr) {
+  *notNull = new char[numValues];
+  memset(*notNull, 1, numValues);
+  while (numNulls > 0) {
+uint64_t pos = static_cast(std::rand()) % numValues;
+if ((*notNull)[pos]) {
+  (*notNull)[pos] = static_cast(0);
+  --numNulls;
+}
+  }
+}
+  }
+
+  char * generateData(uint64_t numValues,
+  uint64_t numNulls = 0,
+  char ** notNull = nullptr) {
+generateNotNull(numValues, numNulls, notNull);
+char * res = new char[numValues];
--- End diff --

pass `res` as an argument. Let's allocate and deallocate at the same place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r117873801
  
--- Diff: c++/test/TestByteRLEEncoder.cc ---
@@ -0,0 +1,231 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "ByteRLE.hh"
+#include "MemoryOutputStream.hh"
+
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#include 
+
+namespace orc {
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
+
+  void generateNotNull(uint64_t numValues,
+   uint64_t numNulls,
+   char ** notNull) {
+if (numNulls != 0 && notNull != nullptr) {
+  *notNull = new char[numValues];
+  memset(*notNull, 1, numValues);
+  while (numNulls > 0) {
+uint64_t pos = static_cast(std::rand()) % numValues;
+if ((*notNull)[pos]) {
+  (*notNull)[pos] = static_cast(0);
+  --numNulls;
+}
+  }
+}
+  }
+
+  char * generateData(uint64_t numValues,
+  uint64_t numNulls = 0,
+  char ** notNull = nullptr) {
+generateNotNull(numValues, numNulls, notNull);
+char * res = new char[numValues];
+for (uint64_t i = 0; i < numValues; ++i) {
+res[i] = static_cast(std::rand() % 256);
+}
+return res;
+  }
+
+  char * generateBoolData(uint64_t numValues,
+  uint64_t numNulls = 0,
+  char ** notNull = nullptr) {
+generateNotNull(numValues, numNulls, notNull);
+char * res = new char[numValues];
+for (uint64_t i = 0; i < numValues; ++i) {
+res[i] = static_cast(std::rand() % 2);
+  }
+return res;
+  }
+
+  void decodeAndVerify(
+   const MemoryOutputStream& memStream,
+   char * data,
+   uint64_t numValues,
+   char* notNull) {
+
+std::unique_ptr inStream(
+  new SeekableArrayInputStream(memStream.getData(), 
memStream.getLength()));
+
+std::unique_ptr decoder =
+  createByteRleDecoder(std::move(inStream));
+
+char* decodedData = new char[numValues];
--- End diff --

use `std::vector data_buf` as a buffer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r117869507
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
+int numLiterals;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
+
+void writeByte(char c);
+void writeValues();
+void write(char c);
+  };
+
+  ByteRleEncoderImpl::ByteRleEncoderImpl(
+std::unique_ptr 
output)
+  : outputStream(std::move(output)) {
+literals = new char[MAX_LITERAL_SIZE];
--- End diff --

are you freeing this anywhere? RAII is a preferred technique for these 
instances.
Reference: http://en.cppreference.com/w/cpp/language/raii
add a `std::vector literals_buf` to the fields above. 
`literals_buf.resize(MAX_LITERAL_SIZE)` and point `literals = 
literals_buf.data()` instead to avoid using a smart pointer.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r117874551
  
--- Diff: c++/test/TestByteRLEEncoder.cc ---
@@ -0,0 +1,231 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "ByteRLE.hh"
+#include "MemoryOutputStream.hh"
+
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#include 
+
+namespace orc {
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
+
+  void generateNotNull(uint64_t numValues,
+   uint64_t numNulls,
+   char ** notNull) {
+if (numNulls != 0 && notNull != nullptr) {
+  *notNull = new char[numValues];
+  memset(*notNull, 1, numValues);
+  while (numNulls > 0) {
+uint64_t pos = static_cast(std::rand()) % numValues;
+if ((*notNull)[pos]) {
+  (*notNull)[pos] = static_cast(0);
+  --numNulls;
+}
+  }
+}
+  }
+
+  char * generateData(uint64_t numValues,
+  uint64_t numNulls = 0,
+  char ** notNull = nullptr) {
+generateNotNull(numValues, numNulls, notNull);
+char * res = new char[numValues];
+for (uint64_t i = 0; i < numValues; ++i) {
+res[i] = static_cast(std::rand() % 256);
+}
+return res;
+  }
+
+  char * generateBoolData(uint64_t numValues,
+  uint64_t numNulls = 0,
+  char ** notNull = nullptr) {
+generateNotNull(numValues, numNulls, notNull);
+char * res = new char[numValues];
+for (uint64_t i = 0; i < numValues; ++i) {
+res[i] = static_cast(std::rand() % 2);
+  }
+return res;
+  }
+
+  void decodeAndVerify(
+   const MemoryOutputStream& memStream,
+   char * data,
+   uint64_t numValues,
+   char* notNull) {
+
+std::unique_ptr inStream(
+  new SeekableArrayInputStream(memStream.getData(), 
memStream.getLength()));
+
+std::unique_ptr decoder =
+  createByteRleDecoder(std::move(inStream));
+
+char* decodedData = new char[numValues];
+decoder->next(decodedData, numValues, notNull);
+
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (!notNull || notNull[i]) {
+EXPECT_EQ(data[i], decodedData[i]);
+  }
+}
+
+delete [] decodedData;
+  }
+
+  void decodeAndVerifyBoolean(
+   const MemoryOutputStream& memStream,
+   char * data,
+   uint64_t numValues,
+   char* notNull) {
+
+std::unique_ptr inStream(
+  new SeekableArrayInputStream(memStream.getData(), 
memStream.getLength()));
+
+std::unique_ptr decoder =
+  createBooleanRleDecoder(std::move(inStream));
+
+char* decodedData = new char[numValues];
+decoder->next(decodedData, numValues, notNull);
+
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (!notNull || notNull[i]) {
+bool expect = data[i] != 0;
+bool actual = decodedData[i] != 0;
+EXPECT_EQ(expect, actual);
+  }
+}
+
+delete [] decodedData;
+  }
+
+  TEST(ByteRleEncoder, random_chars) {
+MemoryOutputStream memStream(DEFAULT_MEM_STREAM_SIZE);
+MemoryPool * pool = getDefaultPool();
+
+uint64_t capacity = 500 * 1024;
+uint64_t block = 1024;
+BufferedOutputStream bufStream(*pool, , capacity, block);
+
+std::unique_ptr outStream(
+new BufferedOutputStream(*pool, , capacity, block));
+
+std::unique_ptr encoder =
+  createByteRleEncoder(std::move(outStream));
+
+char* data = generateData(102400);
+encoder->add(data, 

[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r117874400
  
--- Diff: c++/test/TestByteRLEEncoder.cc ---
@@ -0,0 +1,231 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "ByteRLE.hh"
+#include "MemoryOutputStream.hh"
+
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#include 
+
+namespace orc {
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
+
+  void generateNotNull(uint64_t numValues,
+   uint64_t numNulls,
+   char ** notNull) {
+if (numNulls != 0 && notNull != nullptr) {
+  *notNull = new char[numValues];
--- End diff --

pass `notNull` as an argument. Let's allocate and deallocate at the same 
place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r117869240
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
+int numLiterals;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
--- End diff --

`char*` (remove space)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r117873257
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
+int numLiterals;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
+
+void writeByte(char c);
+void writeValues();
+void write(char c);
+  };
+
+  ByteRleEncoderImpl::ByteRleEncoderImpl(
+std::unique_ptr 
output)
+  : outputStream(std::move(output)) {
+literals = new char[MAX_LITERAL_SIZE];
+numLiterals = 0;
+tailRunLength = 0;
+repeat = false;
+bufferPosition = 0;
+bufferLength = 0;
+buffer = nullptr;
+  }
+
+  ByteRleEncoderImpl::~ByteRleEncoderImpl() {
+// PASS
+  }
+
+  void ByteRleEncoderImpl::writeByte(char c) {
+if (bufferPosition == bufferLength) {
+  int addedSize = 0;
+  if (!outputStream->Next(reinterpret_cast(), 
)) {
+throw std::bad_alloc();
--- End diff --

When does `BufferedOutputStream::Next` return `false`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r117869197
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
--- End diff --

`char*` (remove space)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r117873692
  
--- Diff: c++/src/RLEv1.hh ---
@@ -26,6 +26,59 @@
 
 namespace orc {
 
+class RleEncoderV1 : public RleEncoder {
+public:
+RleEncoderV1(std::unique_ptr outStream,
+ bool hasSigned);
+~RleEncoderV1();
+
+/**
+ * Encode the next batch of values.
+ * @param data the array to be written
+ * @param numValues the number of values to write
+ * @param notNull If the pointer is null, all values are writen. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+void add(const int64_t* data, uint64_t numValues,
+ const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+uint64_t getBufferSize() const override {
+return outputStream->getSize();
+}
+
+/**
+ * Flushing underlying BufferedOutputStream
+ */
+uint64_t flush() override;
+
+/**
+ * record current position
+ * @param recorder use the recorder to record current positions
+ */
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+private:
+std::unique_ptr outputStream;
+bool isSigned;
+int64_t* literals;
+int numLiterals;
+int64_t delta;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
--- End diff --

`char*` (remove space)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-18 Thread xndai
GitHub user xndai opened a pull request:

https://github.com/apache/orc/pull/126

ORC-191 Implement RLE v1 encoder

Implement integer, byte and boolean RLE encoder, and add corresponding
UTs. Only v1 is implemented at this point.

Change-Id: Ia03b0d8bf98a843e45da8a5d349efeff9fb4fa00

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xndai/orc dev_rle

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/126.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #126






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---