from:"xndai"

[GitHub] orc pull request #301: ORC-395: Support ZSTD in C++ writer/reader

2018-09-05 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/301#discussion_r215486726
  
--- Diff: c++/src/Compression.cc ---
@@ -899,6 +907,166 @@ DIAGNOSTIC_POP
 return static_cast(result);
   }
 
+  /**
+   * Block compression base class
+   */
+  class BlockCompressionStream: public CompressionStreamBase {
--- End diff --

Stream compression is not in this change yet?


---

[GitHub] orc pull request #301: ORC-395: Support ZSTD in C++ writer/reader

2018-09-05 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/301#discussion_r215486619
  
--- Diff: c++/src/Compression.cc ---
@@ -899,6 +907,166 @@ DIAGNOSTIC_POP
 return static_cast(result);
   }
 
+  /**
+   * Block compression base class
+   */
+  class BlockCompressionStream: public CompressionStreamBase {
+  public:
+BlockCompressionStream(OutputStream * outStream,
+   int compressionLevel,
+   uint64_t capacity,
+   uint64_t blockSize,
+   MemoryPool& pool)
+   : CompressionStreamBase(outStream,
+   compressionLevel,
+   capacity,
+   blockSize,
+   pool)
+   , compressorBuffer(pool) {
+  // PASS
+}
+
+virtual bool Next(void** data, int*size) override;
+virtual std::string getName() const override = 0;
+
+  protected:
+// compresses a block and returns the compressed size
+virtual uint64_t doBlockCompression() = 0;
+
+// return maximum possible compression size for allocating space for
+// compressorBuffer below
+virtual uint64_t estimateMaxCompressionSize() = 0;
+
+// should allocate max possible compressed size
+DataBuffer compressorBuffer;
+  };
+
+  bool BlockCompressionStream::Next(void** data, int*size) {
+if (bufferSize != 0) {
+  ensureHeader();
+
+  // perform compression
+  size_t totalCompressedSize = doBlockCompression();
+
+  const unsigned char * dataToWrite = nullptr;
+  int totalSizeToWrite = 0;
+  char * header = outputBuffer + outputPosition - 3;
+
+  if (totalCompressedSize >= static_cast(bufferSize)) {
+writeHeader(header, static_cast(bufferSize), true);
+dataToWrite = rawInputBuffer.data();
+totalSizeToWrite = bufferSize;
+  } else {
+writeHeader(header, totalCompressedSize, false);
+dataToWrite = compressorBuffer.data();
+totalSizeToWrite = static_cast(totalCompressedSize);
+  }
+
+  char * dst = header + 3;
+  while (totalSizeToWrite > 0) {
+if (outputPosition >= outputSize) {
--- End diff --

assert outputPosition not larger than outputSize.


---

[GitHub] orc issue #268: ORC-363 Enable zstd decompression in ORC Java reader

2018-08-10 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/268
  
Hi all, I'd like to bring this up again. It's almost another three months, 
and we haven't seen the zstd java library yet. I would suggest we move forward 
to add zstd support in C++ reader/writer. And at the same time, enable Java 
reader by this change. 


---

[GitHub] orc issue #300: [ORC-394][C++] Add addUserMetadata() function to C++ write

2018-08-10 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/300
  
Looks good. Thanks for adding this.


---

[GitHub] orc issue #293: ORC-388: Fix isSafeSubtract to use logic operator instead of...

2018-07-25 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/293
  
LGTM


---

[GitHub] orc issue #273: ORC-343 Enable C++ writer to support RleV2

2018-05-25 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/273
  
@majetideepak this is RLEv2 change that was promised.

@yuruiz Could you also include some perf data obtained from offline testing?


---

[GitHub] orc issue #268: ORC-363 Enable zstd decompression in ORC Java reader

2018-05-17 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/268
  
The current solution is not perfect. But at least it gives us some ability 
to read zstd Orc files, which I believe is important from the compatibility 
perspective - our in-house system has zstd Orc that would like to be consumed 
by Hive, Spark, etc. I am not sure when the zstd compressor will be available. 
It's probably another 6 months or a year.

If we enable zstd on C++ reader/writer first. Then we enable Java reader to 
consume zstd Orc from C++ writer. Would you consider that as end-to-end test?


---

[GitHub] orc pull request #268: ORC-363 Enable zstd decompression in ORC Java reader

2018-05-14 Thread xndai

GitHub user xndai opened a pull request:

https://github.com/apache/orc/pull/268

ORC-363 Enable zstd decompression in ORC Java reader

1. Upgrade aircompressor lib to 0.11
2. Enable Zstd decompression in Java reader
3. Zstd compression is still not availiable. Will throw illegal argument
exception if writer uses zstd compression.

Change-Id: Ic5492b09af4e1e51215e62ed29233729857c45ac

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xndai/orc ORC-363

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/268.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #268


commit 3401dd556ccce1492b12f0cc9f3e2f2cdfd9b848
Author: Xiening.Dai <xiening.dai@...>
Date:   2018-05-14T21:42:46Z

ORC-363 Enable zstd decompression in ORC Java reader

1. Upgrade aircompressor lib to 0.11
2. Enable Zstd decompression in Java reader
3. Zstd compression is still not availiable. Will throw illegal argument
exception if writer uses zstd compression.

Change-Id: Ic5492b09af4e1e51215e62ed29233729857c45ac




---

[GitHub] orc issue #245: ORC-161: Proposal for new decimal encodings and statistics.

2018-04-18 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/245
  
+1 for Gang's proposal.


---

[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-13 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r181530787
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
--- End diff --

Is this a final list of v2 or we are still working on it? I have one 
proposal to add to ORC v2, which is what I call "clustered index". Basically 
the writer can specify a sorting property on one or more columns, then we 
create an index section in ORC file with keys being the column(s) value and the 
value is the row number. To reduce the size of index, each row group has one 
entry in the clustered index. This will enable new range scan pattern when 
reader provides upper bound and lower bound of column(s) values. 

I can write up a detailed proposal for this.


---

[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181149073
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

I think we should keep RLE v1 as an option. The C++ writer currently does 
not support RLE v2 (we are working on it). We don't want the new decimal writer 
to have dependency on that.


---

[GitHub] orc issue #243: Update the site with more information about developing.

2018-04-10 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/243
  
LGTM. Thanks for putting together a document.


---

[GitHub] orc issue #241: ORC-332: Add syntax version to orc_proto.proto

2018-04-09 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/241
  
LGTM


---

[GitHub] orc issue #240: ORC-331: [C++] Initial support of Windows/MSVC

2018-04-08 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/240
  
We should also add an MSVC build in travis. Otherwise there's no way to 
verify it. 


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

2018-03-24 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/233#discussion_r176909659
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -1194,9 +1194,8 @@ namespace orc {
 bool hasNull = false;
 for (uint64_t i = 0; i < numValues; ++i) {
   if (notNull == nullptr || notNull[i]) {
-// TimestampVectorBatch stores data in local timezone
-int64_t millsUTC =
-  timezone.convertToUTC(secs[i]) * 1000 + nanos[i] / 100;
+// TimestampVectorBatch stores data in UTC timezone
--- End diff --

Can you add a comment for TimestampVectorBatch regarding this requirement?


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

2018-03-24 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/233#discussion_r176910052
  
--- Diff: c++/src/ColumnReader.cc ---
@@ -336,8 +336,7 @@ namespace orc {
   }
 }
 int64_t writerTime = secsBuffer[i] + epochOffset;
-secsBuffer[i] = writerTime +
-  writerTimezone.getVariant(writerTime).gmtOffset;
+secsBuffer[i] = writerTimezone.convertToUTC(writerTime);
--- End diff --

If it reads a file created by Java writer with writer timezone other than 
GMT, you are not going to get the same clock reading from ColumnPrinter, are 
you?


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

2018-03-23 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/233
  
IMO the ideal approach of achieving the timestamp semantics is to have the 
caller pass in session timezone, and the reader return timestamp batch based on 
that. That would be similar to what we do with Java reader today except that 
Java reader assumes current thread timezone as session timezone, which in some 
case can be wrong. The other option is we return timestamp in a fixed timezone 
(such as GMT as C++ reader does today) and the behavior needs to be documented 
as part of the interface. The caller can always convert it to session timezone 
accordingly. But currently the Java and C++ reader behaves differently which 
can be misleading.


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

2018-03-22 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/233
  
Sorry, I am confused after reading the discussions above. The key question 
I have is - do we implement ORC TIMESTAMP as SQL "TIMESTAMP with Timezone" or 
"TIMESTAMP without Timezone"? It seems to me that we implement it as the later 
one. That's why we went for a rather complicated design that involved local 
epoch and logics to handle DST while moving between time zones. But 
@majetideepak comment above stated that we wanted to implement TIMESTAMP as 
TIMESTAMP with Tz with ORC-10. This contradicts what I saw in Java reader 
implementation in which timestamp value is adjusted per reader time zone 
(TreeReaderFactory.java line 987 to 992). 

So if my understanding is correct, which means TIMESTAMP should be 
implemented as TIMESTAMP w/o Tz, then the current C++ reader has a bug that it 
always adjusts to gmt rather than the reader timezone (ColumnReader.cc line 
339, 340). 

@omalley is probably the best person to answer this question...


---

[GitHub] orc issue #214: ORC-273: [C++] add range check to prevent bad memory access.

2018-02-06 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/214
  
LGTM.


---

[GitHub] orc issue #212: ORC-293: [C++] Fix RleEncoderV1 for case when sizeof(long) <...

2018-01-24 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/212
  
+1


---

[GitHub] orc pull request #212: ORC-293: [C++] Fix RleEncoderV1 for case when sizeof(...

2018-01-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/212#discussion_r163463896
  
--- Diff: c++/include/orc/Common.hh ---
@@ -69,7 +69,7 @@ namespace orc {
 UNKNOWN_WRITER = INT32_MAX
   };
 
-  enum CompressionKind {
+  enum CompressionKind : std::int64_t {
--- End diff --

We can potentially get rid of this enum and use proto::CompressionKind 
instead. It can be done in a separate change.


---

[GitHub] orc pull request #211: ORC-290 [C++] Update Readme to include C++ writer inf...

2018-01-18 Thread xndai

GitHub user xndai opened a pull request:

https://github.com/apache/orc/pull/211

ORC-290 [C++] Update Readme to include C++ writer info

Change-Id: Ic419427b7441c96f63221ba650f9965e5342680e

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xndai/orc orc_290

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/211.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #211


commit eb0effbf2a298adbd1a7d2b6545522d6e3a8cecd
Author: Xiening.Dai <xndai.git@...>
Date:   2018-01-18T22:08:43Z

ORC-290 [C++] Update Readme to include C++ writer info

Change-Id: Ic419427b7441c96f63221ba650f9965e5342680e




---

[GitHub] orc issue #199: ORC-276: [C++] Create a simple tool to import CSV files

2018-01-01 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/199
  
Hi @majetideepak , Gang is on vacation and will look into your feedback 
after he's back next week. Thx.


---

[GitHub] orc pull request #184: Orc 256 unmask range option

2017-12-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/184#discussion_r155623157
  
--- Diff: java/core/src/test/org/apache/orc/impl/mask/TestUnmaskRange.java 
---
@@ -0,0 +1,165 @@
+package org.apache.orc.impl.mask;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with 
this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License.
+ * You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, 
WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations 
under
+ * the License.
+ */
+
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable;
+import org.junit.Test;
+
+import java.nio.charset.StandardCharsets;
+
+import static org.junit.Assert.assertEquals;
+
+/**
+ * Test Unmask option
+ */
+public class TestUnmaskRange {
+
+  public TestUnmaskRange() {
+super();
+  }
+
+  /* Test for Long */
+  @Test
+  public void testSimpleLongRangeMask() {
+RedactMaskFactory mask = new RedactMaskFactory("9", "", "0:2");
+long result = mask.maskLong(123456);
+assertEquals(123_999, result);
+
+// negative index
+mask = new RedactMaskFactory("9", "", "-3:-1");
+result = mask.maskLong(123456);
+assertEquals(999_456, result);
+
+// out of range mask, return the original mask
+mask = new RedactMaskFactory("9", "", "7:10");
+result = mask.maskLong(123456);
+assertEquals(99, result);
+
+  }
+
+  @Test
+  public void testDefaultRangeMask() {
+RedactMaskFactory mask = new RedactMaskFactory("9", "", "");
+long result = mask.maskLong(123456);
+assertEquals(99, result);
+
+mask = new RedactMaskFactory("9");
+result = mask.maskLong(123456);
+assertEquals(99, result);
+
+  }
+
+  @Test
+  public void testCCRangeMask() {
+long cc = 4716885592186382L;
+long maskedCC = 4716__6382L;
+// Range unmask for first 4 and last 4 of credit card number
+final RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", 
"0:3,-4:-1");
+long result = mask.maskLong(cc);
+
+assertEquals(String.valueOf(cc).length(), 
String.valueOf(result).length());
+assertEquals(4716__6382L, result);
+  }
+
+  /* Tests for Double */
+  @Test
+  public void testSimpleDoubleRangeMask() {
+RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:2");
+assertEquals(1237.77, mask.maskDouble(1234.99), 0.01);
+assertEquals(12377.7, mask.maskDouble(12345.9), 0.01);
+
+mask = new RedactMaskFactory("Xx7", "", "-3:-1");
+assertEquals(7774.9, mask.maskDouble(1234.9), 0.01);
+
+  }
+
+  /* test for String */
+  @Test
+  public void testStringRangeMask() {
+
+BytesColumnVector source = new BytesColumnVector();
+BytesColumnVector target = new BytesColumnVector();
+target.reset();
+
+byte[] input = "Mary had 1 little 
lamb!!".getBytes(StandardCharsets.UTF_8);
+source.setRef(0, input, 0, input.length);
+
+// Set a 4 byte chinese character (U+2070E), which is letter other
+input = "\uD841\uDF0E".getBytes(StandardCharsets.UTF_8);
+source.setRef(1, input, 0, input.length);
+
+RedactMaskFactory mask = new RedactMaskFactory("", "", "0:3, -5:-1");
+for(int r=0; r < 2; ++r) {
+  mask.maskString(source, r, target);
+}
+
+assertEquals("Mary xxx 9 xx xamb!!", new String(target.vector[0],
+target.start[0], target.length[0], StandardCharsets.UTF_8));
+assertEquals("\uD841\uDF0E", new String(target.vector[1],
+target.start[1], target.length[1], StandardCharsets.UTF_8));
+
+// test defaults, no-unmask range
+mask = new

[GitHub] orc pull request #184: Orc 256 unmask range option

2017-12-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/184#discussion_r154761512
  
--- Diff: 
java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
@@ -245,8 +271,8 @@ public void maskData(ColumnVector original, 
ColumnVector masked, int start,
 target.isNull[0] = source.isNull[0];
   } else {
 for(int r = start; r < start + length; ++r) {
-  target.vector[r] = maskLong(source.vector[r]) & mask;
-  target.isNull[r] = source.isNull[r];
+target.vector[r] = maskLong(source.vector[r]) & mask;
--- End diff --

Remove leading space. Same as below.


---

[GitHub] orc pull request #184: Orc 256 unmask range option

2017-12-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/184#discussion_r154762146
  
--- Diff: 
java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
@@ -619,7 +646,7 @@ public double maskDouble(double value) {
 } else if (posn < 0) {
   posn = -posn -2;
 }
-return DIGIT_REPLACEMENT * base * DOUBLE_POWER_10[posn];
+return unmaskRangeDoubleValue(value,DIGIT_REPLACEMENT * base * 
DOUBLE_POWER_10[posn]);
--- End diff --

Add space after comma.


---

[GitHub] orc issue #198: Orc-272: Minor porting changes.

2017-12-07 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/198
  
LGTM


---

[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...

2017-12-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/199#discussion_r155596136
  
--- Diff: tools/src/CSVFileImport.cc ---
@@ -0,0 +1,411 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Exceptions.hh"
+#include "orc/OrcFile.hh"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DELIMITER ','
+
+std::string extractColumn(std::string s, uint64_t colIndex) {
+  uint64_t col = 0;
+  size_t start = 0;
+  size_t end = s.find(DELIMITER);
+  while (col < colIndex && end != std::string::npos) {
+start = end + 1;
+end = s.find(DELIMITER, start);
+++col;
+  }
+  return s.substr(start, end - start);
--- End diff --

You will subtract from string::npos when the last column doesn't end with a 
delimiter or the number of columns in csv is less than what specified in 
schema. We need to better handle these cases.


---

[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...

2017-12-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/199#discussion_r155593796
  
--- Diff: tools/src/CSVFileImport.cc ---
@@ -0,0 +1,411 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Exceptions.hh"
+#include "orc/OrcFile.hh"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DELIMITER ','
+
+std::string extractColumn(std::string s, uint64_t colIndex) {
+  uint64_t col = 0;
+  size_t start = 0;
+  size_t end = s.find(DELIMITER);
+  while (col < colIndex && end != std::string::npos) {
+start = end + 1;
+end = s.find(DELIMITER, start);
+++col;
+  }
+  return s.substr(start, end - start);
+}
+
+static const char* GetDate(void)
+{
+  static char buf[200];
+  time_t t = time(NULL);
+  struct tm* p = localtime();
+  strftime(buf, sizeof(buf), "[%Y-%m-%d %H:%M:%S]", p);
+  return buf;
+}
+
+void fillLongValues(const std::vector& data,
+orc::ColumnVectorBatch* batch,
+uint64_t numValues,
+uint64_t colIndex) {
+  orc::LongVectorBatch* longBatch =
+dynamic_cast<orc::LongVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  longBatch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  longBatch->data[i] = atoll(col.c_str());
+}
+  }
+  longBatch->hasNulls = hasNull;
+  longBatch->numElements = numValues;
+}
+
+void fillStringValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex,
+  orc::DataBuffer& buffer,
+  uint64_t& offset) {
+  orc::StringVectorBatch* stringBatch =
+dynamic_cast<orc::StringVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  stringBatch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  memcpy(buffer.data() + offset,
+ col.c_str(),
+ col.size());
+  stringBatch->data[i] = buffer.data() + offset;
+  stringBatch->length[i] = static_cast(col.size());
+  offset += col.size();
+}
+  }
+  stringBatch->hasNulls = hasNull;
+  stringBatch->numElements = numValues;
+}
+
+void fillDoubleValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex) {
+  orc::DoubleVectorBatch* dblBatch =
+dynamic_cast<orc::DoubleVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  dblBatch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  dblBatch->data[i] = atof(col.c_str());
+}
+  }
+  dblBatch->hasNulls = hasNull;
+  dblBatch->numElements = numValues;
+}
+
+// parse fixed point decimal numbers
+void fillDecimalValues(const std::vector& data,
+   orc::ColumnVectorBatch* batch,
+   uint64_t numValues,
+   uint64_t colIndex,
+   size_t scale,
+   size_t precision) {
+
+
+  orc::Decimal128VectorBatch* d128Batch = ORC_NULLPTR;
+  orc::Decimal64VectorBatch* d64Batch = ORC_NULLPTR;
+  i

[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...

2017-12-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/199#discussion_r155594426
  
--- Diff: tools/src/CSVFileImport.cc ---
@@ -0,0 +1,411 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Exceptions.hh"
+#include "orc/OrcFile.hh"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DELIMITER ','
--- End diff --

I think it will be very helpful to parameterize the delimiter.


---

[GitHub] orc issue #169: [WIP] ORC-203 Modify the StringStatistics to trim the minimu...

2017-09-22 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/169
  
@dain if the concern is the performance, should we also limit the string 
length when generate stats in writer path, which in my opinion is more costly? 
I think if we keep the string compare up to 1024 bytes, we won't need to trim 
at the end.


---

[GitHub] orc issue #169: [WIP] ORC-203 Modify the StringStatistics to trim the minimu...

2017-09-21 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/169
  
I don't understand why you need to trim the strings. Protobuf doesn't 
support strings over 1024 characters?


---

[GitHub] orc issue #151: ORC-226 Support getWriterId in c++ reader interface

2017-09-20 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/151
  
Squash commit. Thanks @ajayyadava @majetideepak 


---

[GitHub] orc issue #151: ORC-226 Support getWriterId in c++ reader interface

2017-09-15 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/151
  
@majetideepak how do I do that? :)


---

[GitHub] orc issue #151: ORC-226 Support getWriterId in c++ reader interface

2017-09-13 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/151
  
@majetideepak pls take another look. Thx!


---

[GitHub] orc issue #134: Orc 17

2017-09-13 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/134
  
LGTM


---

[GitHub] orc pull request #134: Orc 17

2017-09-13 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r138531251
  
--- Diff: c++/src/OrcHdfsFile.cc ---
@@ -0,0 +1,172 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/OrcFile.hh"
+
+#include "Adaptor.hh"
+#include "Exceptions.hh"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hdfspp/hdfspp.h"
+
+namespace orc {
+
+  class HdfsFileInputStream : public InputStream {
+  private:
+std::string filename;
+std::unique_ptr file;
+std::unique_ptr file_system;
+uint64_t totalLength;
+
+  public:
+HdfsFileInputStream(std::string _filename) {
+  filename = _filename ;
+
+  //Building a URI object from the given uri_path
+  hdfs::URI uri;
+  try {
+uri = hdfs::URI::parse_from_string(filename);
+  } catch (const hdfs::uri_parse_error&) {
+throw ParseError("Malformed URI: " + filename);
+  }
+
+  //This sets conf path to default "$HADOOP_CONF_DIR" or 
"/etc/hadoop/conf"
+  //and loads configs core-site.xml and hdfs-site.xml from the conf 
path
+  hdfs::ConfigParser parser;
+  if(!parser.LoadDefaultResources()){
+throw ParseError("Could not load default resources. ");
+  }
+  auto stats = parser.ValidateResources();
+  //validating core-site.xml
+  if(!stats[0].second.ok()){
+throw ParseError(stats[0].first + " is invalid: " + 
stats[0].second.ToString());
+  }
+  //validating hdfs-site.xml
+  if(!stats[1].second.ok()){
+throw ParseError(stats[1].first + " is invalid: " + 
stats[1].second.ToString());
+  }
+  hdfs::Options options;
+  if(!parser.get_options(options)){
+throw ParseError("Could not load Options object. ");
+  }
+  hdfs::IoService * io_service = hdfs::IoService::New();
+  //Wrapping file_system into a unique pointer to guarantee deletion
+  file_system = std::unique_ptr(
+  hdfs::FileSystem::New(io_service, "", options));
+  if (file_system.get() == nullptr) {
+throw ParseError("Can't create FileSystem object. ");
+  }
+  hdfs::Status status;
+  //Checking if the user supplied the host
+  if(!uri.get_host().empty()){
+//Using port if supplied, otherwise using "" to look up port in 
configs
+std::string port = uri.has_port() ?
+std::to_string(uri.get_port()) : "";
+status = file_system->Connect(uri.get_host(), port);
+if (!status.ok()) {
+  throw ParseError("Can't connect to " + uri.get_host()
+  + ":" + port + ". " + status.ToString());
+}
+  } else {
+status = file_system->ConnectToDefaultFs();
+if (!status.ok()) {
+  if(!options.defaultFS.get_host().empty()){
+throw ParseError("Error connecting to " +
+options.defaultFS.str() + ". " + status.ToString());
+  } else {
+throw ParseError(
+"Error connecting to the cluster: defaultFS is empty. "
++ status.ToString());
+  }
+}
+  }
+
+  if (file_system.get() == nullptr) {
+throw ParseError("Can't connect the file system. ");
+  }
+
+  hdfs::FileHandle *file_raw = nullptr;
+  status = file_system->Open(uri.get_path(), _raw);
--- End diff --

My feeling is that you may want to return other meta info in the future. 
But it's your call.


---

[GitHub] orc pull request #151: ORC-226 Support getWriterId in c++ reader interface

2017-09-11 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/151#discussion_r138235902
  
--- Diff: c++/include/orc/Reader.hh ---
@@ -288,6 +288,18 @@ namespace orc {
 virtual uint64_t getCompressionSize() const = 0;
 
 /**
+ * Get ID of writer that generated the file.
+ * @return UNKNOWN_WRITER if the writer ID is undefined
+ */
+virtual WriterId getWriterId() const = 0;
+
+/**
+ * Get the writer id value when getWriterId() returns an unknown 
writer.
+ * @return the integer value of the writer ID.
+ */
+virtual int getUnknownWriterIdValue() const = 0;
--- End diff --

Sounds good. I will update it. Thanks.


---

[GitHub] orc pull request #134: Orc 17

2017-09-11 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r138018152
  
--- Diff: c++/src/OrcHdfsFile.cc ---
@@ -0,0 +1,172 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/OrcFile.hh"
+
+#include "Adaptor.hh"
+#include "Exceptions.hh"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hdfspp/hdfspp.h"
+
+namespace orc {
+
+  class HdfsFileInputStream : public InputStream {
+  private:
+std::string filename;
+std::unique_ptr file;
+std::unique_ptr file_system;
+uint64_t totalLength;
+
+  public:
+HdfsFileInputStream(std::string _filename) {
+  filename = _filename ;
+
+  //Building a URI object from the given uri_path
+  hdfs::URI uri;
+  try {
+uri = hdfs::URI::parse_from_string(filename);
+  } catch (const hdfs::uri_parse_error&) {
+throw ParseError("Malformed URI: " + filename);
+  }
+
+  //This sets conf path to default "$HADOOP_CONF_DIR" or 
"/etc/hadoop/conf"
+  //and loads configs core-site.xml and hdfs-site.xml from the conf 
path
+  hdfs::ConfigParser parser;
+  if(!parser.LoadDefaultResources()){
+throw ParseError("Could not load default resources. ");
+  }
+  auto stats = parser.ValidateResources();
+  //validating core-site.xml
+  if(!stats[0].second.ok()){
+throw ParseError(stats[0].first + " is invalid: " + 
stats[0].second.ToString());
+  }
+  //validating hdfs-site.xml
+  if(!stats[1].second.ok()){
+throw ParseError(stats[1].first + " is invalid: " + 
stats[1].second.ToString());
+  }
+  hdfs::Options options;
+  if(!parser.get_options(options)){
+throw ParseError("Could not load Options object. ");
+  }
+  hdfs::IoService * io_service = hdfs::IoService::New();
+  //Wrapping file_system into a unique pointer to guarantee deletion
+  file_system = std::unique_ptr(
+  hdfs::FileSystem::New(io_service, "", options));
+  if (file_system.get() == nullptr) {
+throw ParseError("Can't create FileSystem object. ");
+  }
+  hdfs::Status status;
+  //Checking if the user supplied the host
+  if(!uri.get_host().empty()){
+//Using port if supplied, otherwise using "" to look up port in 
configs
+std::string port = uri.has_port() ?
+std::to_string(uri.get_port()) : "";
+status = file_system->Connect(uri.get_host(), port);
+if (!status.ok()) {
+  throw ParseError("Can't connect to " + uri.get_host()
+  + ":" + port + ". " + status.ToString());
+}
+  } else {
+status = file_system->ConnectToDefaultFs();
+if (!status.ok()) {
+  if(!options.defaultFS.get_host().empty()){
+throw ParseError("Error connecting to " +
+options.defaultFS.str() + ". " + status.ToString());
+  } else {
+throw ParseError(
+"Error connecting to the cluster: defaultFS is empty. "
++ status.ToString());
+  }
+}
+  }
+
+  if (file_system.get() == nullptr) {
+throw ParseError("Can't connect the file system. ");
+  }
+
+  hdfs::FileHandle *file_raw = nullptr;
+  status = file_system->Open(uri.get_path(), _raw);
--- End diff --

nit: you may want to cache "status" within the class.


---

[GitHub] orc pull request #134: Orc 17

2017-09-11 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r138019713
  
--- Diff: c++/include/orc/OrcFile.hh ---
@@ -103,12 +103,18 @@ namespace orc {
   };
 
   /**
-   * Create a stream to a local file.
+   * Create a stream to a local file or HDFS file if path begins with 
"hdfs://"
--- End diff --

why do you want readLocalFile() to handle hdfs files? You already have 
readHdfsFile() which does exactly that.


---

[GitHub] orc issue #159: ORC-175: ORC-232: add jmh-generator-annprocess in pom.xml. i...

2017-09-11 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/159
  
@iamhumanbeing did you compare it with zstd? Based on my experience, zstd 
is way better than igzip. I would expect a similar result with ISA-L. It 
doesn't seem to be adding a lot of value if we plan to support zstd in near 
future.


---

[GitHub] orc issue #151: ORC-226 Support getWriterId in c++ reader interface

2017-09-04 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/151
  
@omalley please take another look. thanks.


---

[GitHub] orc pull request #151: ORC-226 Support getWriterId in c++ reader interface

2017-09-02 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/151#discussion_r136697977
  
--- Diff: c++/include/orc/Reader.hh ---
@@ -288,6 +288,17 @@ namespace orc {
 virtual uint64_t getCompressionSize() const = 0;
 
 /**
+ * Get ID of writer that generated the file.
+ * Current availiable Orc writers:
+ * 0 = ORC Java
+ * 1 = ORC C++
+ * 2 = Presto
+ * @param id out parameter for writer id
+ * @return true if writer id is availiable, false if otherwise
+ */
+virtual bool getWriterId(uint32_t & id) const = 0;
--- End diff --

ok, that sounds good. I will update it in my next change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #151: ORC-226 Support getWriterId in c++ reader interface

2017-08-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/151#discussion_r134897076
  
--- Diff: c++/include/orc/Reader.hh ---
@@ -288,6 +288,17 @@ namespace orc {
 virtual uint64_t getCompressionSize() const = 0;
 
 /**
+ * Get ID of writer that generated the file.
+ * Current availiable Orc writers:
+ * 0 = ORC Java
+ * 1 = ORC C++
+ * 2 = Presto
+ * @param id out parameter for writer id
+ * @return true if writer id is availiable, false if otherwise
+ */
+virtual bool getWriterId(uint32_t & id) const = 0;
--- End diff --

@omalley can we close on this by just returning the actual code? Or you 
have other concerns? Thx.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #151: ORC-226 Support getWriterId in c++ reader interface

2017-08-15 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/151#discussion_r10343
  
--- Diff: c++/include/orc/Reader.hh ---
@@ -288,6 +288,17 @@ namespace orc {
 virtual uint64_t getCompressionSize() const = 0;
 
 /**
+ * Get ID of writer that generated the file.
+ * Current availiable Orc writers:
+ * 0 = ORC Java
+ * 1 = ORC C++
+ * 2 = Presto
+ * @param id out parameter for writer id
+ * @return true if writer id is availiable, false if otherwise
+ */
+virtual bool getWriterId(uint32_t & id) const = 0;
--- End diff --

Then we need a logger in Orc library which we don't have now. What's the 
problem of returning actual value instead of ORC_FUTURE?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #151: ORC-226 Support getWriterId in c++ reader interface

2017-08-14 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/151#discussion_r133008869
  
--- Diff: c++/include/orc/Reader.hh ---
@@ -288,6 +288,17 @@ namespace orc {
 virtual uint64_t getCompressionSize() const = 0;
 
 /**
+ * Get ID of writer that generated the file.
+ * Current availiable Orc writers:
+ * 0 = ORC Java
+ * 1 = ORC C++
+ * 2 = Presto
+ * @param id out parameter for writer id
+ * @return true if writer id is availiable, false if otherwise
+ */
+virtual bool getWriterId(uint32_t & id) const = 0;
--- End diff --

Ok, I see. But I am still not sure about returning ORC_FUTURE. The caller 
may want to obtain the exact writer ID for debugging purpose. It can choose to 
ignore if it wants. We just give them one more option. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #151: ORC-226 Support getWriterId in c++ reader interface

2017-08-10 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/151#discussion_r132589551
  
--- Diff: c++/include/orc/Reader.hh ---
@@ -288,6 +288,17 @@ namespace orc {
 virtual uint64_t getCompressionSize() const = 0;
 
 /**
+ * Get ID of writer that generated the file.
+ * Current availiable Orc writers:
+ * 0 = ORC Java
+ * 1 = ORC C++
+ * 2 = Presto
+ * @param id out parameter for writer id
+ * @return true if writer id is availiable, false if otherwise
+ */
+virtual bool getWriterId(uint32_t & id) const = 0;
--- End diff --

Will there be any files generated by old Presto writer without an ID? Also 
if we choose to return ORC_FUTURE, there's no way to determine the actual 
writer ID. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #152: ORC-227: [C++] Fix docker failure due to ExternalProj...

2017-08-10 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/152#discussion_r132531200
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -26,9 +26,9 @@
 
 namespace orc {
 
-  const size_t MINIMUM_REPEAT = 3;
-  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
-  const size_t MAX_LITERAL_SIZE = 128;
+  const int MINIMUM_REPEAT = 3;
+  const int MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const int MAX_LITERAL_SIZE = 128;
--- End diff --

ok. Thanks for fixing this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #152: ORC-227: [C++] Fix docker failure due to ExternalProj...

2017-08-10 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/152#discussion_r132522783
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -26,9 +26,9 @@
 
 namespace orc {
 
-  const size_t MINIMUM_REPEAT = 3;
-  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
-  const size_t MAX_LITERAL_SIZE = 128;
+  const int MINIMUM_REPEAT = 3;
+  const int MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const int MAX_LITERAL_SIZE = 128;
--- End diff --

I don't see the mismatch warnings on CentOS7 docker image. Where did you 
see that? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-08-07 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/128
  
@omalley Update change again, please take a look. I also reduce the initial 
stream memory capacity from 4M to 1M, and greatly reduce the pre-allocated 
memory for footers and ps. This should alleviate the concern on writer memory 
consumption. We can figure out a long term solution later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-08-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r131780372
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,228 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  class Timezone;
+
+  /**
+   * Options for creating a Writer.
+   */
+  class WriterOptions {
+  private:
+ORC_UNIQUE_PTR privateBits;
+
+  public:
+WriterOptions();
+WriterOptions(const WriterOptions&);
+WriterOptions(WriterOptions&);
+WriterOptions& operator=(const WriterOptions&);
+virtual ~WriterOptions();
+
+/**
+ * Set the strip size.
+ */
+WriterOptions& setStripeSize(uint64_t size);
+
+/**
+ * Get the strip size.
+ * @return if not set, return default value.
+ */
+uint64_t getStripeSize() const;
+
+/**
+ * Set the data compression block size.
+ */
+WriterOptions& setCompressionBlockSize(uint64_t size);
+
+/**
+ * Get the data compression block size.
+ * @return if not set, return default value.
+ */
+uint64_t getCompressionBlockSize() const;
+
+/**
+ * Set row index stride.
+ */
+WriterOptions& setRowIndexStride(uint64_t stride);
--- End diff --

Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #142: [ORC-218] Cache timezone information in the library.

2017-07-31 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/142#discussion_r130473949
  
--- Diff: c++/src/CMakeLists.txt ---
@@ -125,6 +125,83 @@ include_directories (
   ${LZ4_INCLUDE_DIRS}
   )
 
+# To avoid reading the Timezone database from disk, we load the file 
during the
+# build and inject them as binary into the library. This can increase the 
size
+# of library but avoids doing system calls at runtime. The behavior can be
+# disabled by using the CMake flag NO_EMBEDDED_TZ_DB
+if (NOT DEFINED EMBEDDED_TZ_DB)
+  set(EMBEDDED_TZ_DB ON)
--- End diff --

A side effect of this is when the time zone info files are updated, the Orc 
lib has to be recompiled and redistributed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-07-20 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r128647874
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,294 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  enum RleVersion {
+RleVersion_1,
+RleVersion_2
+  };
+
+  class Timezone;
+
+  /**
+   * Options for creating a Writer.
+   */
+  class WriterOptions {
+  private:
+ORC_UNIQUE_PTR privateBits;
+
+  public:
+WriterOptions();
+WriterOptions(const WriterOptions&);
+WriterOptions(WriterOptions&);
+WriterOptions& operator=(const WriterOptions&);
+virtual ~WriterOptions();
+
+/**
+ * Set the strip size.
+ */
+WriterOptions& setStripeSize(uint64_t size);
+
+/**
+ * Get the strip size.
+ * @return if not set, return default value.
+ */
+uint64_t getStripeSize() const;
+
+/**
+ * Set the block size.
+ */
+WriterOptions& setBlockSize(uint64_t size);
+
+/**
+ * Get the block size.
+ * @return if not set, return default value.
+ */
+uint64_t getBlockSize() const;
+
+/**
+ * Set row index stride.
+ */
+WriterOptions& setRowIndexStride(uint64_t stride);
+
+/**
+ * Get the index stride size.
+ * @return if not set, return default value.
+ */
+uint64_t getRowIndexStride() const;
+
+/**
+ * Set the buffer size.
+ */
+WriterOptions& setBufferSize(uint64_t size);
+
+/**
+ * Get the buffer size.
+ * @return if not set, return default value.
+ */
+uint64_t getBufferSize() const;
+
+/**
+ * Set the dictionary key size threshold.
+ * 0 to disable dictionary encoding.
+ * 1 to always enable dictionary encoding.
+ */
+WriterOptions& setDictionaryKeySizeThreshold(double val);
+
+/**
+ * Get the dictionary key size threshold.
+ */
+double getDictionaryKeySizeThreshold() const;
+
+/**
+ * Set whether or not to have block padding.
+ */
+WriterOptions& setBlockPadding(bool padding);
+
+/**
+ * Get whether or not to have block padding.
+ * @return if not set, return default value which is false.
+ */
+bool getBlockPadding() const;
+
+/**
+ * Set Run length encoding version
+ */
+WriterOptions& setRleVersion(RleVersion version);
+
+/**
+ * Get Run Length Encoding version
+ */
+RleVersion getRleVersion() const;
+
+/**
+ * Set compression kind.
+ */
+WriterOptions& setCompression(CompressionKind comp);
+
+/**
+ * Get the compression kind.
+ * @return if not set, return default value which is ZLIB.
+ */
+CompressionKind getCompression() const;
+
+/**
+ * Set the encoding strategy.
+ */
+WriterOptions& setEncodingStrategy(EncodingStrategy strategy);
+
+/**
+ * Get the encoding strategy.
+ * @return if

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-07-19 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r128392669
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,294 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  enum RleVersion {
+RleVersion_1,
+RleVersion_2
+  };
+
+  class Timezone;
+
+  /**
+   * Options for creating a Writer.
+   */
+  class WriterOptions {
+  private:
+ORC_UNIQUE_PTR privateBits;
+
+  public:
+WriterOptions();
+WriterOptions(const WriterOptions&);
+WriterOptions(WriterOptions&);
+WriterOptions& operator=(const WriterOptions&);
+virtual ~WriterOptions();
+
+/**
+ * Set the strip size.
+ */
+WriterOptions& setStripeSize(uint64_t size);
+
+/**
+ * Get the strip size.
+ * @return if not set, return default value.
+ */
+uint64_t getStripeSize() const;
+
+/**
+ * Set the block size.
+ */
+WriterOptions& setBlockSize(uint64_t size);
+
+/**
+ * Get the block size.
+ * @return if not set, return default value.
+ */
+uint64_t getBlockSize() const;
+
+/**
+ * Set row index stride.
+ */
+WriterOptions& setRowIndexStride(uint64_t stride);
+
+/**
+ * Get the index stride size.
+ * @return if not set, return default value.
+ */
+uint64_t getRowIndexStride() const;
+
+/**
+ * Set the buffer size.
+ */
+WriterOptions& setBufferSize(uint64_t size);
+
+/**
+ * Get the buffer size.
+ * @return if not set, return default value.
+ */
+uint64_t getBufferSize() const;
+
+/**
+ * Set the dictionary key size threshold.
+ * 0 to disable dictionary encoding.
+ * 1 to always enable dictionary encoding.
+ */
+WriterOptions& setDictionaryKeySizeThreshold(double val);
+
+/**
+ * Get the dictionary key size threshold.
+ */
+double getDictionaryKeySizeThreshold() const;
+
+/**
+ * Set whether or not to have block padding.
+ */
+WriterOptions& setBlockPadding(bool padding);
+
+/**
+ * Get whether or not to have block padding.
+ * @return if not set, return default value which is false.
+ */
+bool getBlockPadding() const;
+
+/**
+ * Set Run length encoding version
+ */
+WriterOptions& setRleVersion(RleVersion version);
+
+/**
+ * Get Run Length Encoding version
+ */
+RleVersion getRleVersion() const;
+
+/**
+ * Set compression kind.
+ */
+WriterOptions& setCompression(CompressionKind comp);
+
+/**
+ * Get the compression kind.
+ * @return if not set, return default value which is ZLIB.
+ */
+CompressionKind getCompression() const;
+
+/**
+ * Set the encoding strategy.
+ */
+WriterOptions& setEncodingStrategy(EncodingStrategy strategy);
+
+/**
+ * Get the encoding strategy.
+ * @return if

[GitHub] orc issue #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-07-05 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/128
  
Hi @omalley , please see my replies and the new commit. Let me know if you 
have further questions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #134: Orc 17

2017-07-05 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r125729396
  
--- Diff: c++/src/OrcHdfsFile.cc ---
@@ -34,15 +34,13 @@
 #include "common/hdfs_configuration.h"
 #include "common/configuration_loader.h"
 
-#define BUF_SIZE 1048576 //1 MB
-
 namespace orc {
 
   class HdfsFileInputStream : public InputStream {
--- End diff --

Does HDFS C++ provide a mocked service interface? My concern is the 
testability of HdfsFileInputStream.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #134: Orc 17

2017-07-05 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r125728966
  
--- Diff: c++/src/OrcHdfsFile.cc ---
@@ -123,17 +123,21 @@ namespace orc {
   uint64_t length,
   uint64_t offset) override {
 
-  size_t last_bytes_read = 0;
-
   if (!buf) {
 throw ParseError("Buffer is null");
   }
 
   hdfs::Status status;
-  status = file->PositionRead(buf, static_cast(length), 
static_cast(offset), _bytes_read);
-  if(!status.ok()) {
-throw ParseError("Error reading the file: " + status.ToString());
-  }
+  size_t total_bytes_read = 0;
+  size_t last_bytes_read = 0;
+  
+  do {
+status = file->PositionRead(buf, static_cast(length) - 
total_bytes_read, static_cast(offset + total_bytes_read), 
_bytes_read);
--- End diff --

Make sure to wrap on 80 characters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #134: Orc 17

2017-07-05 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r125728350
  
--- Diff: c++/src/OrcHdfsFile.cc ---
@@ -66,22 +64,22 @@ namespace orc {
 options = config->GetOptions();
   }
   hdfs::IoService * io_service = hdfs::IoService::New();
-  //Wrapping fs into a shared pointer to guarantee deletion
-  std::shared_ptr 
fs(hdfs::FileSystem::New(io_service, "", options));
-  if (!fs) {
+  //Wrapping file_system into a unique pointer to guarantee deletion
+  file_system = 
std::unique_ptr(hdfs::FileSystem::New(io_service, "", 
options));
+  if (!file_system) {
--- End diff --

Unfortunately unique_ptr can be redefined as auto_ptr if platform doesn't 
support unique_ptr (see orc_config.hh). We talked about removing this redefine, 
but it hasn't been done yet. So I'd suggest we stick to file_system.get() != 
nullptr for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-29 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r124720147
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -0,0 +1,507 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Int128.hh"
+#include "orc/Writer.hh"
+
+#include "ByteRLE.hh"
+#include "ColumnWriter.hh"
+#include "RLE.hh"
+#include "Statistics.hh"
+#include "Timezone.hh"
+
+namespace orc {
+  StreamsFactory::~StreamsFactory() {
+//PASS
+  }
+
+  class StreamsFactoryImpl : public StreamsFactory {
+  public:
+StreamsFactoryImpl(
+   const WriterOptions& writerOptions,
+   OutputStream* outputStream) :
+   options(writerOptions),
+   outStream(outputStream) {
+   }
+
+virtual std::unique_ptr
+createStream(proto::Stream_Kind kind) override;
+  private:
+const WriterOptions& options;
+OutputStream* outStream;
+  };
+
+  std::unique_ptr StreamsFactoryImpl::createStream(
+  proto::Stream_Kind) {
+// In the future, we can decide compression strategy and modifier
+// based on stream kind. But for now we just use the setting from
+// WriterOption
+return createCompressor(
+options.getCompression(),
+outStream,
+options.getCompressionStrategy(),
+options.getBufferSize(),
+options.getBlockSize(),
+*options.getMemoryPool());
+  }
+
+  std::unique_ptr createStreamsFactory(
+const WriterOptions& options,
+OutputStream* outStream) {
+return std::unique_ptr(
+   new StreamsFactoryImpl(options, 
outStream));
+  }
+
+  RowIndexPositionRecorder::~RowIndexPositionRecorder() {
+// PASS
+  }
+
+  ColumnWriter::ColumnWriter(
+ const Type& type,
+ StreamsFactory& factory,
+ const WriterOptions& options) :
+columnId(type.getColumnId()),
+streamsFactory(factory),
+colIndexStatistics(),
+colStripeStatistics(),
+colFileStatistics(),
+enableIndex(options.getEnableIndex()),
+enableStats(options.getEnableStats()),
+rowIndex(),
+rowIndexEntry(),
+rowIndexPosition(),
+memPool(*options.getMemoryPool()),
+indexStream() {
+
+std::unique_ptr presentStream =
+factory.createStream(proto::Stream_Kind_PRESENT);
+notNullEncoder = createBooleanRleEncoder(std::move(presentStream));
+
+if (enableIndex || enableStats) {
+  bool enableStrCmp = options.getEnableStrStatsCmp();
+  colIndexStatistics = createColumnStatistics(type, enableStrCmp);
+  if (enableStats) {
+colStripeStatistics = createColumnStatistics(type, enableStrCmp);
+colFileStatistics = createColumnStatistics(type, enableStrCmp);
+  }
+}
+
+if (enableIndex) {
+  rowIndex = std::unique_ptr(new proto::RowIndex());
+  rowIndexEntry =
+std::unique_ptr(new proto::RowIndexEntry());
+  rowIndexPosition

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-29 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r124720134
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,294 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  enum RleVersion {
+RleVersion_1,
+RleVersion_2
+  };
+
+  class Timezone;
+
+  /**
+   * Options for creating a Writer.
+   */
+  class WriterOptions {
+  private:
+ORC_UNIQUE_PTR privateBits;
+
+  public:
+WriterOptions();
+WriterOptions(const WriterOptions&);
+WriterOptions(WriterOptions&);
+WriterOptions& operator=(const WriterOptions&);
+virtual ~WriterOptions();
+
+/**
+ * Set the strip size.
+ */
+WriterOptions& setStripeSize(uint64_t size);
+
+/**
+ * Get the strip size.
+ * @return if not set, return default value.
+ */
+uint64_t getStripeSize() const;
+
+/**
+ * Set the block size.
+ */
+WriterOptions& setBlockSize(uint64_t size);
+
+/**
+ * Get the block size.
+ * @return if not set, return default value.
+ */
+uint64_t getBlockSize() const;
+
+/**
+ * Set row index stride.
+ */
+WriterOptions& setRowIndexStride(uint64_t stride);
+
+/**
+ * Get the index stride size.
+ * @return if not set, return default value.
+ */
+uint64_t getRowIndexStride() const;
+
+/**
+ * Set the buffer size.
+ */
+WriterOptions& setBufferSize(uint64_t size);
+
+/**
+ * Get the buffer size.
+ * @return if not set, return default value.
+ */
+uint64_t getBufferSize() const;
+
+/**
+ * Set the dictionary key size threshold.
+ * 0 to disable dictionary encoding.
+ * 1 to always enable dictionary encoding.
+ */
+WriterOptions& setDictionaryKeySizeThreshold(double val);
+
+/**
+ * Get the dictionary key size threshold.
+ */
+double getDictionaryKeySizeThreshold() const;
+
+/**
+ * Set whether or not to have block padding.
+ */
+WriterOptions& setBlockPadding(bool padding);
+
+/**
+ * Get whether or not to have block padding.
+ * @return if not set, return default value which is false.
+ */
+bool getBlockPadding() const;
+
+/**
+ * Set Run length encoding version
+ */
+WriterOptions& setRleVersion(RleVersion version);
+
+/**
+ * Get Run Length Encoding version
+ */
+RleVersion getRleVersion() const;
+
+/**
+ * Set compression kind.
+ */
+WriterOptions& setCompression(CompressionKind comp);
+
+/**
+ * Get the compression kind.
+ * @return if not set, return default value which is ZLIB.
+ */
+CompressionKind getCompression() const;
+
+/**
+ * Set the encoding strategy.
+ */
+WriterOptions& setEncodingStrategy(EncodingStrategy strategy);
+
+/**
+ * Get the encoding strategy.
+ * @return if

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-29 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r124718827
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,294 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  enum RleVersion {
+RleVersion_1,
+RleVersion_2
+  };
+
+  class Timezone;
+
+  /**
+   * Options for creating a Writer.
+   */
+  class WriterOptions {
+  private:
+ORC_UNIQUE_PTR privateBits;
+
+  public:
+WriterOptions();
+WriterOptions(const WriterOptions&);
+WriterOptions(WriterOptions&);
+WriterOptions& operator=(const WriterOptions&);
+virtual ~WriterOptions();
+
+/**
+ * Set the strip size.
+ */
+WriterOptions& setStripeSize(uint64_t size);
+
+/**
+ * Get the strip size.
+ * @return if not set, return default value.
+ */
+uint64_t getStripeSize() const;
+
+/**
+ * Set the block size.
+ */
+WriterOptions& setBlockSize(uint64_t size);
+
+/**
+ * Get the block size.
+ * @return if not set, return default value.
+ */
+uint64_t getBlockSize() const;
+
+/**
+ * Set row index stride.
+ */
+WriterOptions& setRowIndexStride(uint64_t stride);
+
+/**
+ * Get the index stride size.
+ * @return if not set, return default value.
+ */
+uint64_t getRowIndexStride() const;
+
+/**
+ * Set the buffer size.
+ */
+WriterOptions& setBufferSize(uint64_t size);
+
+/**
+ * Get the buffer size.
+ * @return if not set, return default value.
+ */
+uint64_t getBufferSize() const;
+
+/**
+ * Set the dictionary key size threshold.
+ * 0 to disable dictionary encoding.
+ * 1 to always enable dictionary encoding.
+ */
+WriterOptions& setDictionaryKeySizeThreshold(double val);
+
+/**
+ * Get the dictionary key size threshold.
+ */
+double getDictionaryKeySizeThreshold() const;
+
+/**
+ * Set whether or not to have block padding.
+ */
+WriterOptions& setBlockPadding(bool padding);
+
+/**
+ * Get whether or not to have block padding.
+ * @return if not set, return default value which is false.
+ */
+bool getBlockPadding() const;
+
+/**
+ * Set Run length encoding version
+ */
+WriterOptions& setRleVersion(RleVersion version);
+
+/**
+ * Get Run Length Encoding version
+ */
+RleVersion getRleVersion() const;
+
+/**
+ * Set compression kind.
+ */
+WriterOptions& setCompression(CompressionKind comp);
+
+/**
+ * Get the compression kind.
+ * @return if not set, return default value which is ZLIB.
+ */
+CompressionKind getCompression() const;
+
+/**
+ * Set the encoding strategy.
+ */
+WriterOptions& setEncodingStrategy(EncodingStrategy strategy);
+
+/**
+ * Get the encoding strategy.
+ * @return if

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-29 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r124718523
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,294 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  enum RleVersion {
+RleVersion_1,
+RleVersion_2
+  };
+
+  class Timezone;
+
+  /**
+   * Options for creating a Writer.
+   */
+  class WriterOptions {
+  private:
+ORC_UNIQUE_PTR privateBits;
+
+  public:
+WriterOptions();
+WriterOptions(const WriterOptions&);
+WriterOptions(WriterOptions&);
+WriterOptions& operator=(const WriterOptions&);
+virtual ~WriterOptions();
+
+/**
+ * Set the strip size.
+ */
+WriterOptions& setStripeSize(uint64_t size);
+
+/**
+ * Get the strip size.
+ * @return if not set, return default value.
+ */
+uint64_t getStripeSize() const;
+
+/**
+ * Set the block size.
+ */
+WriterOptions& setBlockSize(uint64_t size);
+
+/**
+ * Get the block size.
+ * @return if not set, return default value.
+ */
+uint64_t getBlockSize() const;
+
+/**
+ * Set row index stride.
+ */
+WriterOptions& setRowIndexStride(uint64_t stride);
+
+/**
+ * Get the index stride size.
+ * @return if not set, return default value.
+ */
+uint64_t getRowIndexStride() const;
+
+/**
+ * Set the buffer size.
+ */
+WriterOptions& setBufferSize(uint64_t size);
+
+/**
+ * Get the buffer size.
+ * @return if not set, return default value.
+ */
+uint64_t getBufferSize() const;
+
+/**
+ * Set the dictionary key size threshold.
+ * 0 to disable dictionary encoding.
+ * 1 to always enable dictionary encoding.
+ */
+WriterOptions& setDictionaryKeySizeThreshold(double val);
+
+/**
+ * Get the dictionary key size threshold.
+ */
+double getDictionaryKeySizeThreshold() const;
+
+/**
+ * Set whether or not to have block padding.
+ */
+WriterOptions& setBlockPadding(bool padding);
+
+/**
+ * Get whether or not to have block padding.
+ * @return if not set, return default value which is false.
+ */
+bool getBlockPadding() const;
+
+/**
+ * Set Run length encoding version
+ */
+WriterOptions& setRleVersion(RleVersion version);
+
+/**
+ * Get Run Length Encoding version
+ */
+RleVersion getRleVersion() const;
+
+/**
+ * Set compression kind.
+ */
+WriterOptions& setCompression(CompressionKind comp);
+
+/**
+ * Get the compression kind.
+ * @return if not set, return default value which is ZLIB.
+ */
+CompressionKind getCompression() const;
+
+/**
+ * Set the encoding strategy.
+ */
+WriterOptions& setEncodingStrategy(EncodingStrategy strategy);
+
+/**
+ * Get the encoding strategy.
+ * @return if

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-29 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r124718336
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,294 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  enum RleVersion {
+RleVersion_1,
+RleVersion_2
+  };
+
+  class Timezone;
+
+  /**
+   * Options for creating a Writer.
+   */
+  class WriterOptions {
+  private:
+ORC_UNIQUE_PTR privateBits;
+
+  public:
+WriterOptions();
+WriterOptions(const WriterOptions&);
+WriterOptions(WriterOptions&);
+WriterOptions& operator=(const WriterOptions&);
+virtual ~WriterOptions();
+
+/**
+ * Set the strip size.
+ */
+WriterOptions& setStripeSize(uint64_t size);
+
+/**
+ * Get the strip size.
+ * @return if not set, return default value.
+ */
+uint64_t getStripeSize() const;
+
+/**
+ * Set the block size.
+ */
+WriterOptions& setBlockSize(uint64_t size);
+
+/**
+ * Get the block size.
+ * @return if not set, return default value.
+ */
+uint64_t getBlockSize() const;
+
+/**
+ * Set row index stride.
+ */
+WriterOptions& setRowIndexStride(uint64_t stride);
+
+/**
+ * Get the index stride size.
+ * @return if not set, return default value.
+ */
+uint64_t getRowIndexStride() const;
+
+/**
+ * Set the buffer size.
+ */
+WriterOptions& setBufferSize(uint64_t size);
+
+/**
+ * Get the buffer size.
+ * @return if not set, return default value.
+ */
+uint64_t getBufferSize() const;
+
+/**
+ * Set the dictionary key size threshold.
+ * 0 to disable dictionary encoding.
+ * 1 to always enable dictionary encoding.
+ */
+WriterOptions& setDictionaryKeySizeThreshold(double val);
+
+/**
+ * Get the dictionary key size threshold.
+ */
+double getDictionaryKeySizeThreshold() const;
+
+/**
+ * Set whether or not to have block padding.
+ */
+WriterOptions& setBlockPadding(bool padding);
--- End diff --

Sure. Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-28 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r124715175
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,294 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  enum RleVersion {
+RleVersion_1,
+RleVersion_2
+  };
--- End diff --

I will add a FileVersion enum. I will still keep RleVersion enum for 
creating different version of encoder but will remove it from public header.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #134: Orc 17

2017-06-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r123852932
  
--- Diff: c++/src/OrcHdfsFile.cc ---
@@ -121,39 +116,24 @@ namespace orc {
 }
 
 uint64_t getNaturalReadSize() const override {
-  return BUF_SIZE;
+  return 1048576; //1 MB
 }
 
 void read(void* buf,
   uint64_t length,
   uint64_t offset) override {
 
-  ssize_t total_bytes_read = 0;
   size_t last_bytes_read = 0;
 
   if (!buf) {
 throw ParseError("Buffer is null");
   }
 
   hdfs::Status status;
-  char input_buffer[BUF_SIZE];
-  do{
-//Reading file chunks
-status = file->PositionRead(input_buffer, sizeof(input_buffer), 
offset, _bytes_read);
-if(status.ok()) {
-  //Writing file chunks to buf
-  if(total_bytes_read + last_bytes_read >= length){
-memcpy((char*) buf + total_bytes_read, input_buffer, length - 
total_bytes_read);
-break;
-  } else {
-memcpy((char*) buf + total_bytes_read, input_buffer, 
last_bytes_read);
-total_bytes_read += last_bytes_read;
-offset += last_bytes_read;
-  }
-} else {
-  throw ParseError("Error reading the file: " + status.ToString());
-}
-  } while (true);
+  status = file->PositionRead(buf, static_cast(length), 
static_cast(offset), _bytes_read);
--- End diff --

you still need to put this in a loop in case last_bytes_read is less than 
length.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #134: Orc 17

2017-06-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r123853053
  
--- Diff: c++/src/OrcHdfsFile.cc ---
@@ -66,22 +64,22 @@ namespace orc {
 options = config->GetOptions();
   }
   hdfs::IoService * io_service = hdfs::IoService::New();
-  //Wrapping fs into a shared pointer to guarantee deletion
-  std::shared_ptr 
fs(hdfs::FileSystem::New(io_service, "", options));
-  if (!fs) {
+  //Wrapping file_system into a unique pointer to guarantee deletion
+  file_system = 
std::unique_ptr(hdfs::FileSystem::New(io_service, "", 
options));
+  if (!file_system) {
--- End diff --

are you sure this would work? Should it be if (file_system.get() != 
nullptr) ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #134: Orc 17

2017-06-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r123853204
  
--- Diff: c++/src/OrcHdfsFile.cc ---
@@ -34,15 +34,13 @@
 #include "common/hdfs_configuration.h"
 #include "common/configuration_loader.h"
 
-#define BUF_SIZE 1048576 //1 MB
-
 namespace orc {
 
   class HdfsFileInputStream : public InputStream {
--- End diff --

do you have corresponding UTs for this new class?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #134: Orc 17

2017-06-22 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r123646956
  
--- Diff: c++/src/OrcHdfsFile.cc ---
@@ -0,0 +1,170 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/OrcFile.hh"
+
+#include "Adaptor.hh"
+#include "Exceptions.hh"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hdfspp/hdfspp.h"
+#include "common/uri.h"
+#include "common/hdfs_configuration.h"
+#include "common/configuration_loader.h"
+
+#define BUF_SIZE 1048576 //1 MB
+
+namespace orc {
+
+  class HdfsFileInputStream : public InputStream {
+  private:
+std::string filename;
+std::unique_ptr file;
+std::shared_ptr file_system;
+uint64_t totalLength;
+
+  public:
+HdfsFileInputStream(std::string _filename) {
+  filename = _filename ;
+
+  //Building a URI object from the given uri_path
+  hdfs::optional uri = 
hdfs::URI::parse_from_string(filename);
+  if (!uri) {
+throw ParseError("Malformed URI: " + filename);
+  }
+
+  hdfs::Options options;
+  //Setting the config path to the default: "$HADOOP_CONF_DIR" or 
"/etc/hadoop/conf"
+  hdfs::ConfigurationLoader loader;
+  //Loading default config files core-site.xml and hdfs-site.xml from 
the config path
+  hdfs::optional config = 
loader.LoadDefaultResources();
+  //TODO: HDFS-9539 - after this is resolved, valid config will always 
be returned.
+  if(config){
+//Loading options from the config
+options = config->GetOptions();
+  }
+  hdfs::IoService * io_service = hdfs::IoService::New();
+  //Wrapping fs into a shared pointer to guarantee deletion
+  std::shared_ptr 
fs(hdfs::FileSystem::New(io_service, "", options));
+  if (!fs) {
+throw ParseError("Can't create FileSystem object. ");
+  }
+  hdfs::Status status;
+  //Check if the user supplied the host
+  if(!uri.value().get_host().empty()){
+//If port is supplied we use it, otherwise we use the empty string 
so that it will be looked up in configs.
+std::string port = (uri.value().get_port()) ? 
std::to_string(uri.value().get_port().value()) : "";
+status = fs->Connect(uri.value().get_host(), port);
+if (!status.ok()) {
+  throw ParseError("Can't connect to " + uri.value().get_host() + 
":" + port + ". " + status.ToString());
+}
+  } else {
+status = fs->ConnectToDefaultFs();
+if (!status.ok()) {
+  if(!options.defaultFS.get_host().empty()){
+throw ParseError("Error connecting to " + 
options.defaultFS.str() + ". " + status.ToString());
+  } else {
+throw ParseError("Error connecting to the cluster: defaultFS 
is empty. " + status.ToString());
+  }
+}
+  }
+
+  if (!fs) {
+throw ParseError("Can't connect the file system. ");
+  }
+
+  hdfs::FileHandle *file_raw = nullptr;
+  status = fs->Open(uri->get_path(), _raw);
+  if (!status.ok()) {
+throw ParseError("Can't open " + uri->get_path() + ". " + 
status.ToString());
+  }
+  //wrapping file_raw into a unique pointer to guarantee deletion
+  std::unique_ptr file_handle(file_raw);
+  file = std::move(file_handle);
+  //transferring the ownership of FileSystem to HdfsFileInputStream to 
avoid premature deallocation
+  file_system = fs;
+
+  hdfs::StatInfo stat_info;
+  status = fs->GetFi

[GitHub] orc issue #132: ORC-202. Add writer implementation enum to file format

2017-06-21 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/132
  
That makes sense. I add the writer code in c++ writer. We might need a 
similar change for c++ reader. Will do it in a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120799267
  
--- Diff: c++/src/Writer.cc ---
@@ -0,0 +1,659 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Common.hh"
+#include "orc/OrcFile.hh"
+
+#include "ColumnWriter.hh"
+#include "Timezone.hh"
+
+#include 
+
+namespace orc {
+
+  struct WriterOptionsPrivate {
+uint64_t stripeSize;
+uint64_t blockSize;
+uint64_t rowIndexStride;
+uint64_t bufferSize;
+bool blockPadding;
+CompressionKind compression;
+EncodingStrategy encodingStrategy;
+CompressionStrategy compressionStrategy;
+MemoryPool* memoryPool;
+WriterVersion version;
+double paddingTolerance;
+std::ostream* errorStream;
+RleVersion rleVersion;
+double dictionaryKeySizeThreshold;
+bool enableStats;
+bool enableStrStatsCmp;
+bool enableIndex;
+const Timezone* timezone;
+
+WriterOptionsPrivate() {
+  stripeSize = 64 * 1024 * 1024; // 64M
+  blockSize = 256 * 1024; // 256K
+  rowIndexStride = 1;
+  bufferSize = 4 * 1024 * 1024; // 4M
+  blockPadding = false;
+  compression = CompressionKind_ZLIB;
+  encodingStrategy = EncodingStrategy_SPEED;
+  compressionStrategy = CompressionStrategy_SPEED;
+  memoryPool = getDefaultPool();
+  version = WriterVersion_ORC_135;
+  paddingTolerance = 0.0;
+  errorStream = ::cerr;
+  rleVersion = RleVersion_1;
+  dictionaryKeySizeThreshold = 0.0;
+  enableStats = true;
+  enableStrStatsCmp = false;
+  enableIndex = true;
+  timezone = ();
+}
+  };
+
+  WriterOptions::WriterOptions():
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate())) {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(const WriterOptions& rhs):
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate(*(rhs.privateBits.get() {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(WriterOptions& rhs) {
+// swap privateBits with rhs
+WriterOptionsPrivate* l = privateBits.release();
+privateBits.reset(rhs.privateBits.release());
+rhs.privateBits.reset(l);
+  }
+
+  WriterOptions& WriterOptions::operator=(const WriterOptions& rhs) {
+if (this != ) {
+  privateBits.reset(new 
WriterOptionsPrivate(*(rhs.privateBits.get(;
+}
+return *this;
+  }
+
+  WriterOptions::~WriterOptions() {
+// PASS
+  }
+
+  WriterOptions& WriterOptions::setStripeSize(uint64_t size) {
+privateBits->stripeSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getStripeSize() const {
+return privateBits->stripeSize;
+  }
+
+  WriterOptions& WriterOptions::setBlockSize(uint64_t size) {
+privateBits->blockSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBlockSize() const {
+return privateBits->blockSize;
+  }
+
+  WriterOptions& WriterOptions::setRowIndexStride(uint64_t stride) {
+privateBits->rowIndexStride = stride;
+return *this;
+  }
+
+  uint64_t WriterOptions::getRowIndexStride() const {
+return privateBits->rowIndexStride;
+  }
+
+  WriterOptions& WriterOptions::setBufferSize(uint64_t size) {
+privateBits->bufferSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBufferSize() const {
+return privateBits->bufferSize;
+  }
+
+  WriterOptions& WriterOptions::setDictionaryKeySizeThreshold(double val) {
+privateBits->dictionaryKeySizeThreshold = val;
+return *this;
+  }
+

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120751420
  
--- Diff: c++/src/Writer.cc ---
@@ -0,0 +1,659 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Common.hh"
+#include "orc/OrcFile.hh"
+
+#include "ColumnWriter.hh"
+#include "Timezone.hh"
+
+#include 
+
+namespace orc {
+
+  struct WriterOptionsPrivate {
+uint64_t stripeSize;
+uint64_t blockSize;
+uint64_t rowIndexStride;
+uint64_t bufferSize;
+bool blockPadding;
+CompressionKind compression;
+EncodingStrategy encodingStrategy;
+CompressionStrategy compressionStrategy;
+MemoryPool* memoryPool;
+WriterVersion version;
+double paddingTolerance;
+std::ostream* errorStream;
+RleVersion rleVersion;
+double dictionaryKeySizeThreshold;
+bool enableStats;
+bool enableStrStatsCmp;
+bool enableIndex;
+const Timezone* timezone;
+
+WriterOptionsPrivate() {
+  stripeSize = 64 * 1024 * 1024; // 64M
+  blockSize = 256 * 1024; // 256K
+  rowIndexStride = 1;
+  bufferSize = 4 * 1024 * 1024; // 4M
+  blockPadding = false;
+  compression = CompressionKind_ZLIB;
+  encodingStrategy = EncodingStrategy_SPEED;
+  compressionStrategy = CompressionStrategy_SPEED;
+  memoryPool = getDefaultPool();
+  version = WriterVersion_ORC_135;
+  paddingTolerance = 0.0;
+  errorStream = ::cerr;
+  rleVersion = RleVersion_1;
+  dictionaryKeySizeThreshold = 0.0;
+  enableStats = true;
+  enableStrStatsCmp = false;
+  enableIndex = true;
+  timezone = ();
+}
+  };
+
+  WriterOptions::WriterOptions():
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate())) {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(const WriterOptions& rhs):
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate(*(rhs.privateBits.get() {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(WriterOptions& rhs) {
+// swap privateBits with rhs
+WriterOptionsPrivate* l = privateBits.release();
+privateBits.reset(rhs.privateBits.release());
+rhs.privateBits.reset(l);
+  }
+
+  WriterOptions& WriterOptions::operator=(const WriterOptions& rhs) {
+if (this != ) {
+  privateBits.reset(new 
WriterOptionsPrivate(*(rhs.privateBits.get(;
+}
+return *this;
+  }
+
+  WriterOptions::~WriterOptions() {
+// PASS
+  }
+
+  WriterOptions& WriterOptions::setStripeSize(uint64_t size) {
+privateBits->stripeSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getStripeSize() const {
+return privateBits->stripeSize;
+  }
+
+  WriterOptions& WriterOptions::setBlockSize(uint64_t size) {
+privateBits->blockSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBlockSize() const {
+return privateBits->blockSize;
+  }
+
+  WriterOptions& WriterOptions::setRowIndexStride(uint64_t stride) {
+privateBits->rowIndexStride = stride;
+return *this;
+  }
+
+  uint64_t WriterOptions::getRowIndexStride() const {
+return privateBits->rowIndexStride;
+  }
+
+  WriterOptions& WriterOptions::setBufferSize(uint64_t size) {
+privateBits->bufferSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBufferSize() const {
+return privateBits->bufferSize;
+  }
+
+  WriterOptions& WriterOptions::setDictionaryKeySizeThreshold(double val) {
+privateBits->dictionaryKeySizeThreshold = val;
+return *this;
+  }
+

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120739743
  
--- Diff: c++/src/Writer.cc ---
@@ -0,0 +1,659 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Common.hh"
+#include "orc/OrcFile.hh"
+
+#include "ColumnWriter.hh"
+#include "Timezone.hh"
+
+#include 
+
+namespace orc {
+
+  struct WriterOptionsPrivate {
+uint64_t stripeSize;
+uint64_t blockSize;
+uint64_t rowIndexStride;
+uint64_t bufferSize;
+bool blockPadding;
+CompressionKind compression;
+EncodingStrategy encodingStrategy;
+CompressionStrategy compressionStrategy;
+MemoryPool* memoryPool;
+WriterVersion version;
+double paddingTolerance;
+std::ostream* errorStream;
+RleVersion rleVersion;
+double dictionaryKeySizeThreshold;
+bool enableStats;
+bool enableStrStatsCmp;
+bool enableIndex;
+const Timezone* timezone;
+
+WriterOptionsPrivate() {
+  stripeSize = 64 * 1024 * 1024; // 64M
+  blockSize = 256 * 1024; // 256K
+  rowIndexStride = 1;
+  bufferSize = 4 * 1024 * 1024; // 4M
+  blockPadding = false;
+  compression = CompressionKind_ZLIB;
+  encodingStrategy = EncodingStrategy_SPEED;
+  compressionStrategy = CompressionStrategy_SPEED;
+  memoryPool = getDefaultPool();
+  version = WriterVersion_ORC_135;
+  paddingTolerance = 0.0;
+  errorStream = ::cerr;
+  rleVersion = RleVersion_1;
+  dictionaryKeySizeThreshold = 0.0;
+  enableStats = true;
+  enableStrStatsCmp = false;
+  enableIndex = true;
+  timezone = ();
+}
+  };
+
+  WriterOptions::WriterOptions():
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate())) {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(const WriterOptions& rhs):
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate(*(rhs.privateBits.get() {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(WriterOptions& rhs) {
+// swap privateBits with rhs
+WriterOptionsPrivate* l = privateBits.release();
+privateBits.reset(rhs.privateBits.release());
+rhs.privateBits.reset(l);
+  }
+
+  WriterOptions& WriterOptions::operator=(const WriterOptions& rhs) {
+if (this != ) {
+  privateBits.reset(new 
WriterOptionsPrivate(*(rhs.privateBits.get(;
+}
+return *this;
+  }
+
+  WriterOptions::~WriterOptions() {
+// PASS
+  }
+
+  WriterOptions& WriterOptions::setStripeSize(uint64_t size) {
+privateBits->stripeSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getStripeSize() const {
+return privateBits->stripeSize;
+  }
+
+  WriterOptions& WriterOptions::setBlockSize(uint64_t size) {
+privateBits->blockSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBlockSize() const {
+return privateBits->blockSize;
+  }
+
+  WriterOptions& WriterOptions::setRowIndexStride(uint64_t stride) {
+privateBits->rowIndexStride = stride;
+return *this;
+  }
+
+  uint64_t WriterOptions::getRowIndexStride() const {
+return privateBits->rowIndexStride;
+  }
+
+  WriterOptions& WriterOptions::setBufferSize(uint64_t size) {
+privateBits->bufferSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBufferSize() const {
+return privateBits->bufferSize;
+  }
+
+  WriterOptions& WriterOptions::setDictionaryKeySizeThreshold(double val) {
+privateBits->dictionaryKeySizeThreshold = val;
+return *this;
+  }
+

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120708078
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,294 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  enum RleVersion {
+RleVersion_1,
+RleVersion_2
+  };
+
+  class Timezone;
+
+  /**
+   * Options for creating a Writer.
+   */
+  class WriterOptions {
+  private:
+ORC_UNIQUE_PTR privateBits;
+
+  public:
+WriterOptions();
+WriterOptions(const WriterOptions&);
+WriterOptions(WriterOptions&);
+WriterOptions& operator=(const WriterOptions&);
+virtual ~WriterOptions();
+
+/**
+ * Set the strip size.
+ */
+WriterOptions& setStripeSize(uint64_t size);
+
+/**
+ * Get the strip size.
+ * @return if not set, return default value.
+ */
+uint64_t getStripeSize() const;
+
+/**
+ * Set the block size.
+ */
+WriterOptions& setBlockSize(uint64_t size);
+
+/**
+ * Get the block size.
+ * @return if not set, return default value.
+ */
+uint64_t getBlockSize() const;
+
+/**
+ * Set row index stride.
+ */
+WriterOptions& setRowIndexStride(uint64_t stride);
+
+/**
+ * Get the index stride size.
+ * @return if not set, return default value.
+ */
+uint64_t getRowIndexStride() const;
+
+/**
+ * Set the buffer size.
+ */
+WriterOptions& setBufferSize(uint64_t size);
+
+/**
+ * Get the buffer size.
+ * @return if not set, return default value.
+ */
+uint64_t getBufferSize() const;
+
+/**
+ * Set the dictionary key size threshold.
+ * 0 to disable dictionary encoding.
+ * 1 to always enable dictionary encoding.
+ */
+WriterOptions& setDictionaryKeySizeThreshold(double val);
+
+/**
+ * Get the dictionary key size threshold.
+ */
+double getDictionaryKeySizeThreshold() const;
+
+/**
+ * Set whether or not to have block padding.
+ */
+WriterOptions& setBlockPadding(bool padding);
+
+/**
+ * Get whether or not to have block padding.
+ * @return if not set, return default value which is false.
+ */
+bool getBlockPadding() const;
+
+/**
+ * Set Run length encoding version
+ */
+WriterOptions& setRleVersion(RleVersion version);
+
+/**
+ * Get Run Length Encoding version
+ */
+RleVersion getRleVersion() const;
+
+/**
+ * Set compression kind.
+ */
+WriterOptions& setCompression(CompressionKind comp);
+
+/**
+ * Get the compression kind.
+ * @return if not set, return default value which is ZLIB.
+ */
+CompressionKind getCompression() const;
+
+/**
+ * Set the encoding strategy.
+ */
+WriterOptions& setEncodingStrategy(EncodingStrategy strategy);
+
+/**
+ * Get the encoding strategy.
+ * @return if

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120706989
  
--- Diff: c++/src/Writer.cc ---
@@ -0,0 +1,659 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Common.hh"
+#include "orc/OrcFile.hh"
+
+#include "ColumnWriter.hh"
+#include "Timezone.hh"
+
+#include 
+
+namespace orc {
+
+  struct WriterOptionsPrivate {
+uint64_t stripeSize;
+uint64_t blockSize;
+uint64_t rowIndexStride;
+uint64_t bufferSize;
+bool blockPadding;
+CompressionKind compression;
+EncodingStrategy encodingStrategy;
+CompressionStrategy compressionStrategy;
+MemoryPool *memoryPool;
+WriterVersion version;
+double paddingTolerance;
+std::ostream* errorStream;
+RleVersion rleVersion;
+double dictionaryKeySizeThreshold;
+bool enableStats;
+bool enableStrStatsCmp;
+bool enableIndex;
+const Timezone* timezone;
+
+WriterOptionsPrivate() {
+  stripeSize = 64 * 1024 * 1024; // 64M
+  blockSize = 256 * 1024; // 256K
+  rowIndexStride = 1;
+  bufferSize = 4 * 1024 * 1024; // 4M
+  blockPadding = false;
+  compression = CompressionKind_ZLIB;
+  encodingStrategy = EncodingStrategy_SPEED;
+  compressionStrategy = CompressionStrategy_SPEED;
+  memoryPool = getDefaultPool();
+  version = WriterVersion_ORC_135;
+  paddingTolerance = 0.0;
+  errorStream = ::cerr;
+  rleVersion = RleVersion_1;
+  dictionaryKeySizeThreshold = 0.0;
+  enableStats = true;
+  enableStrStatsCmp = false;
+  enableIndex = true;
+  timezone = ();
+}
+  };
+
+  WriterOptions::WriterOptions():
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate())) {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(const WriterOptions& rhs):
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate(*(rhs.privateBits.get() {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(WriterOptions& rhs) {
+// swap privateBits with rhs
+WriterOptionsPrivate* l = privateBits.release();
+privateBits.reset(rhs.privateBits.release());
+rhs.privateBits.reset(l);
+  }
+
+  WriterOptions& WriterOptions::operator=(const WriterOptions& rhs) {
+if (this != ) {
+  privateBits.reset(new 
WriterOptionsPrivate(*(rhs.privateBits.get(;
+}
+return *this;
+  }
+
+  WriterOptions::~WriterOptions() {
+// PASS
+  }
+
+  WriterOptions& WriterOptions::setStripeSize(uint64_t size) {
+privateBits->stripeSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getStripeSize() const {
+return privateBits->stripeSize;
+  }
+
+  WriterOptions& WriterOptions::setBlockSize(uint64_t size) {
+privateBits->blockSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBlockSize() const {
+return privateBits->blockSize;
+  }
+
+  WriterOptions& WriterOptions::setRowIndexStride(uint64_t stride) {
+privateBits->rowIndexStride = stride;
+return *this;
+  }
+
+  uint64_t WriterOptions::getRowIndexStride() const {
+return privateBits->rowIndexStride;
+  }
+
+  WriterOptions& WriterOptions::setBufferSize(uint64_t size) {
+privateBits->bufferSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBufferSize() const {
+return privateBits->bufferSize;
+  }
+
+  WriterOptions& WriterOptions::setDictionaryKeySizeThreshold(double val) {
+privateBits->dictionaryKeySizeThreshold = val;
+return *this;
+  }
+

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120704852
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -0,0 +1,507 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Int128.hh"
+#include "orc/Writer.hh"
+
+#include "ByteRLE.hh"
+#include "ColumnWriter.hh"
+#include "RLE.hh"
+#include "Statistics.hh"
+#include "Timezone.hh"
+
+namespace orc {
+  StreamsFactory::~StreamsFactory() {
+//PASS
+  }
+
+  class StreamsFactoryImpl : public StreamsFactory {
+  public:
+StreamsFactoryImpl(
+   const WriterOptions& writerOptions,
+   OutputStream * outputStream) :
+   options(writerOptions),
+   outStream(outputStream) {
+   }
+
+virtual std::unique_ptr
+createStream(proto::Stream_Kind kind) override;
+  private:
+const WriterOptions& options;
+OutputStream * outStream;
+  };
+
+  std::unique_ptr StreamsFactoryImpl::createStream(
+  proto::Stream_Kind) {
+// In the future, we can decide compression strategy and modifier
+// based on stream kind. But for now we just use the setting from
+// WriterOption
+return createCompressor(
+options.getCompression(),
+outStream,
+options.getCompressionStrategy(),
+options.getBufferSize(),
+options.getBlockSize(),
+*options.getMemoryPool());
+  }
+
+  std::unique_ptr createStreamsFactory(
+const WriterOptions& options,
+OutputStream * outStream) {
+return std::unique_ptr(
+   new StreamsFactoryImpl(options, 
outStream));
+  }
+
+  RowIndexPositionRecorder::~RowIndexPositionRecorder() {
+// PASS
+  }
+
+  ColumnWriter::ColumnWriter(
+ const Type& type,
+ StreamsFactory& factory,
+ const WriterOptions& options) :
+columnId(type.getColumnId()),
+streamsFactory(factory),
+colIndexStatistics(),
+colStripeStatistics(),
+colFileStatistics(),
+enableIndex(options.getEnableIndex()),
+enableStats(options.getEnableStats()),
+rowIndex(),
+rowIndexEntry(),
+rowIndexPosition(),
+memPool(*options.getMemoryPool()),
+indexStream() {
+
+std::unique_ptr presentStream =
+factory.createStream(proto::Stream_Kind_PRESENT);
+notNullEncoder = createBooleanRleEncoder(std::move(presentStream));
+
+if (enableIndex || enableStats) {
+  bool enableStrCmp = options.getEnableStrStatsCmp();
+  colIndexStatistics = createColumnStatistics(type, enableStrCmp);
+  if (enableStats) {
+colStripeStatistics = createColumnStatistics(type, enableStrCmp);
+colFileStatistics = createColumnStatistics(type, enableStrCmp);
+  }
+}
+
+if (enableIndex) {
+  rowIndex = std::unique_ptr(new proto::RowIndex());
+  rowIndexEntry =
+std::unique_ptr(new proto::RowIndexEntry());
+  rowIndexPosition

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120701079
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -0,0 +1,507 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Int128.hh"
+#include "orc/Writer.hh"
+
+#include "ByteRLE.hh"
+#include "ColumnWriter.hh"
+#include "RLE.hh"
+#include "Statistics.hh"
+#include "Timezone.hh"
+
+namespace orc {
+  StreamsFactory::~StreamsFactory() {
+//PASS
+  }
+
+  class StreamsFactoryImpl : public StreamsFactory {
+  public:
+StreamsFactoryImpl(
+   const WriterOptions& writerOptions,
+   OutputStream * outputStream) :
+   options(writerOptions),
+   outStream(outputStream) {
+   }
+
+virtual std::unique_ptr
+createStream(proto::Stream_Kind kind) override;
+  private:
+const WriterOptions& options;
+OutputStream * outStream;
+  };
+
+  std::unique_ptr StreamsFactoryImpl::createStream(
+  proto::Stream_Kind) {
+// In the future, we can decide compression strategy and modifier
+// based on stream kind. But for now we just use the setting from
+// WriterOption
+return createCompressor(
+options.getCompression(),
+outStream,
+options.getCompressionStrategy(),
+options.getBufferSize(),
+options.getBlockSize(),
+*options.getMemoryPool());
+  }
+
+  std::unique_ptr createStreamsFactory(
+const WriterOptions& options,
+OutputStream * outStream) {
+return std::unique_ptr(
+   new StreamsFactoryImpl(options, 
outStream));
+  }
+
+  RowIndexPositionRecorder::~RowIndexPositionRecorder() {
+// PASS
+  }
+
+  ColumnWriter::ColumnWriter(
+ const Type& type,
+ StreamsFactory& factory,
+ const WriterOptions& options) :
+columnId(type.getColumnId()),
+streamsFactory(factory),
+colIndexStatistics(),
+colStripeStatistics(),
+colFileStatistics(),
+enableIndex(options.getEnableIndex()),
+enableStats(options.getEnableStats()),
+rowIndex(),
+rowIndexEntry(),
+rowIndexPosition(),
+memPool(*options.getMemoryPool()),
+indexStream() {
+
+std::unique_ptr presentStream =
+factory.createStream(proto::Stream_Kind_PRESENT);
+notNullEncoder = createBooleanRleEncoder(std::move(presentStream));
+
+if (enableIndex || enableStats) {
+  bool enableStrCmp = options.getEnableStrStatsCmp();
+  colIndexStatistics = createColumnStatistics(type, enableStrCmp);
+  if (enableStats) {
+colStripeStatistics = createColumnStatistics(type, enableStrCmp);
+colFileStatistics = createColumnStatistics(type, enableStrCmp);
+  }
+}
+
+if (enableIndex) {
+  rowIndex = std::unique_ptr(new proto::RowIndex());
+  rowIndexEntry =
+std::unique_ptr(new proto::RowIndexEntry());
+  rowIndexPosition

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-07 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120699358
  
--- Diff: c++/include/orc/OrcFile.hh ---
@@ -119,6 +120,17 @@ namespace orc {
* @param path the name of the file in the local file system
*/
   ORC_UNIQUE_PTR writeLocalFile(const std::string& path);
+
+  /**
+   * Create a writer to write the ORC file.
+   * @param type the type of data to be written
+   * @param stream the stream to write to
+   * @param options the options for writing the file
+   */
+  ORC_UNIQUE_PTR createWriter(
+  const Type& type,
+  OutputStream * stream,
--- End diff --

I will fix all occurrences of the space issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-01 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/128
  
@majetideepak have you got a chance to take a look? Sorry for the big diff.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-05-24 Thread xndai

GitHub user xndai opened a pull request:

https://github.com/apache/orc/pull/128

ORC-178 Implement Basic C++ Writer and Writer Option

1. Add basic Writer and WriterOption
2. Add StructColumnWriter and IntegerColumnWriter. With them, we will be
able to write a complete ORC file that contains only int columns. To
limit the scope of this change, we will add more column writers later.
3. Add a base class for column statistics impl classes. This
is to be used by the base class of ColumnWriter so we don't have to
duplicate a bunch of logics everywhere.
4. Right now the UTs are pretty primative. We will add more UTs
(especially for stats and index) as we are adding more column writers.
At this moment, it's really hard to extract more UTs from our code base
without intorducing additional column writers.

Change-Id: I694dda600136e5d285e70a8124aa1cc334f4ef14

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xndai/orc dev_writer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/128.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #128


commit 013d18dc459f75763fa1951717c9d3624ab38519
Author: Xiening.Dai <xiening@alibaba-inc.com>
Date:   2017-05-24T22:01:52Z

ORC-178 Implement Basic C++ Writer and Writer Option

1. Add basic Writer and WriterOption
2. Add StructColumnWriter and IntegerColumnWriter. With them, we will be
able to write a complete ORC file that contains only int columns. To
limit the scope of this change, we will add more column writers later.
3. Add a base class for column statistics impl classes. This
is to be used by the base class of ColumnWriter so we don't have to
duplicate a bunch of logics everywhere.
4. Right now the UTs are pretty primative. We will add more UTs
(especially for stats and index) as we are adding more column writers.
At this moment, it's really hard to extract more UTs from our code base
without intorducing additional column writers.

Change-Id: I694dda600136e5d285e70a8124aa1cc334f4ef14




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/126
  
Thanks @majetideepak  for reviewing. @prasanthj @omalley please help review 
and/or accept the PR. Thx!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118057983
  
--- Diff: c++/test/TestByteRLEEncoder.cc ---
@@ -0,0 +1,231 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "ByteRLE.hh"
+#include "MemoryOutputStream.hh"
+
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#include 
+
+namespace orc {
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
+
+  void generateNotNull(uint64_t numValues,
+   uint64_t numNulls,
+   char ** notNull) {
+if (numNulls != 0 && notNull != nullptr) {
+  *notNull = new char[numValues];
+  memset(*notNull, 1, numValues);
+  while (numNulls > 0) {
+uint64_t pos = static_cast(std::rand()) % numValues;
+if ((*notNull)[pos]) {
+  (*notNull)[pos] = static_cast(0);
+  --numNulls;
+}
+  }
+}
+  }
+
+  char * generateData(uint64_t numValues,
+  uint64_t numNulls = 0,
+  char ** notNull = nullptr) {
+generateNotNull(numValues, numNulls, notNull);
+char * res = new char[numValues];
+for (uint64_t i = 0; i < numValues; ++i) {
+res[i] = static_cast(std::rand() % 256);
+}
+return res;
+  }
+
+  char * generateBoolData(uint64_t numValues,
+  uint64_t numNulls = 0,
+  char ** notNull = nullptr) {
+generateNotNull(numValues, numNulls, notNull);
+char * res = new char[numValues];
+for (uint64_t i = 0; i < numValues; ++i) {
+res[i] = static_cast(std::rand() % 2);
+  }
+return res;
+  }
+
+  void decodeAndVerify(
+   const MemoryOutputStream& memStream,
+   char * data,
+   uint64_t numValues,
+   char* notNull) {
+
+std::unique_ptr inStream(
+  new SeekableArrayInputStream(memStream.getData(), 
memStream.getLength()));
+
+std::unique_ptr decoder =
+  createByteRleDecoder(std::move(inStream));
+
+char* decodedData = new char[numValues];
--- End diff --

I don't see a huge difference here - we are manipulating raw buffer anyway. 
I would just leave them there unless you feel strongly about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118057495
  
--- Diff: c++/test/TestByteRLEEncoder.cc ---
@@ -0,0 +1,231 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "ByteRLE.hh"
+#include "MemoryOutputStream.hh"
+
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#include 
+
+namespace orc {
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
+
+  void generateNotNull(uint64_t numValues,
+   uint64_t numNulls,
+   char ** notNull) {
+if (numNulls != 0 && notNull != nullptr) {
+  *notNull = new char[numValues];
+  memset(*notNull, 1, numValues);
+  while (numNulls > 0) {
+uint64_t pos = static_cast(std::rand()) % numValues;
+if ((*notNull)[pos]) {
+  (*notNull)[pos] = static_cast(0);
+  --numNulls;
+}
+  }
+}
+  }
+
+  char * generateData(uint64_t numValues,
+  uint64_t numNulls = 0,
+  char ** notNull = nullptr) {
+generateNotNull(numValues, numNulls, notNull);
+char * res = new char[numValues];
--- End diff --

I have fixed these as well as the same issues in TestRLEv1Encoder.cc. See 
my next iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118048655
  
--- Diff: c++/src/RLEv1.hh ---
@@ -26,6 +26,59 @@
 
 namespace orc {
 
+class RleEncoderV1 : public RleEncoder {
+public:
+RleEncoderV1(std::unique_ptr outStream,
+ bool hasSigned);
+~RleEncoderV1();
+
+/**
+ * Encode the next batch of values.
+ * @param data the array to be written
+ * @param numValues the number of values to write
+ * @param notNull If the pointer is null, all values are writen. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+void add(const int64_t* data, uint64_t numValues,
+ const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+uint64_t getBufferSize() const override {
+return outputStream->getSize();
+}
+
+/**
+ * Flushing underlying BufferedOutputStream
+ */
+uint64_t flush() override;
+
+/**
+ * record current position
+ * @param recorder use the recorder to record current positions
+ */
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+private:
+std::unique_ptr outputStream;
+bool isSigned;
+int64_t* literals;
+int numLiterals;
+int64_t delta;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118048627
  
--- Diff: c++/src/RLEv1.cc ---
@@ -26,8 +26,173 @@
 namespace orc {
 
 const uint64_t MINIMUM_REPEAT = 3;
+const uint64_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+
 const uint64_t BASE_128_MASK = 0x7f;
 
+const int MAX_DELTA = 127;
+const int MIN_DELTA = -128;
+const int MAX_LITERAL_SIZE = 128;
+
+RleEncoderV1::RleEncoderV1(
+  std::unique_ptr outStream,
+  bool hasSigned):
+  outputStream(std::move(outStream)) {
+  isSigned = hasSigned;
+  literals = new int64_t[MAX_LITERAL_SIZE];
--- End diff --

Same as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118048556
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
+int numLiterals;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
+
+void writeByte(char c);
+void writeValues();
+void write(char c);
+  };
+
+  ByteRleEncoderImpl::ByteRleEncoderImpl(
+std::unique_ptr 
output)
+  : outputStream(std::move(output)) {
+literals = new char[MAX_LITERAL_SIZE];
--- End diff --

thanks for catching that. I add a free() call. Regarding to RAII, I think 
it's a bit overkilled here. The array is guaranteed to be allocated in 
constructor and freed in destructor. The logic is quite clear. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118048038
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
+int numLiterals;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118048019
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118046740
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
+int numLiterals;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
+
+void writeByte(char c);
+void writeValues();
+void write(char c);
+  };
+
+  ByteRleEncoderImpl::ByteRleEncoderImpl(
+std::unique_ptr 
output)
+  : outputStream(std::move(output)) {
+literals = new char[MAX_LITERAL_SIZE];
+numLiterals = 0;
+tailRunLength = 0;
+repeat = false;
+bufferPosition = 0;
+bufferLength = 0;
+buffer = nullptr;
+  }
+
+  ByteRleEncoderImpl::~ByteRleEncoderImpl() {
+// PASS
+  }
+
+  void ByteRleEncoderImpl::writeByte(char c) {
+if (bufferPosition == bufferLength) {
+  int addedSize = 0;
+  if (!outputStream->Next(reinterpret_cast(), 
)) {
+throw std::bad_alloc();
--- End diff --

Next() is a method we inherit from protobuf ZeroCopyOutputStream. Current 
implementation of BufferedOutputStream doesn't return false, but we still want 
this check to comply with the function signature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #126: ORC-191 Implement RLE v1 encoder

2017-05-23 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/126#discussion_r118044090
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -27,6 +27,272 @@
 namespace orc {
 
   const size_t MINIMUM_REPEAT = 3;
+  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const size_t MAX_LITERAL_SIZE = 128;
+
+  ByteRleEncoder::~ByteRleEncoder() {
+// PASS
+  }
+
+  class ByteRleEncoderImpl : public ByteRleEncoder {
+  public:
+ByteRleEncoderImpl(std::unique_ptr output);
+virtual ~ByteRleEncoderImpl();
+
+/**
+ * Encode the next batch of values
+ * @param data to be encoded
+ * @param numValues the number of values to be encoded
+ * @param notNull If the pointer is null, all values are read. If the
+ *pointer is not null, positions that are false are skipped.
+ */
+virtual void add(const char* data, uint64_t numValues,
+  const char* notNull) override;
+
+/**
+ * Get size of buffer used so far.
+ */
+virtual uint64_t getBufferSize() const override;
+
+/**
+ * Flushing underlying BufferedOutputStream
+*/
+virtual uint64_t flush() override;
+
+virtual void recordPosition(PositionRecorder* recorder) const override;
+
+  protected:
+std::unique_ptr outputStream;
+char * literals;
+int numLiterals;
+bool repeat;
+int tailRunLength;
+int bufferPosition;
+int bufferLength;
+char * buffer;
+
+void writeByte(char c);
+void writeValues();
+void write(char c);
+  };
+
+  ByteRleEncoderImpl::ByteRleEncoderImpl(
+std::unique_ptr 
output)
+  : outputStream(std::move(output)) {
+literals = new char[MAX_LITERAL_SIZE];
--- End diff --

I use unique_ptr here to indicate ByteRleEncoderImpl has full ownership of 
this buffer. Thanks for catching that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #122: ORC-192 Implement zlib compresion stream

2017-05-22 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/122
  
Hi @majetideepak , what happens next after you approved? are you able to 
accept this pull request?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #122: ORC-192 Implement zlib compresion stream

2017-05-22 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/122#discussion_r117822601
  
--- Diff: c++/src/Compression.cc ---
@@ -636,6 +884,33 @@ DIAGNOSTIC_POP
 return static_cast(result);
   }
 
+  std::unique_ptr
+ createCompressor(
+  CompressionKind kind,
+  OutputStream * outStream,
+  CompressionStrategy strategy,
+  uint64_t bufferCapacity,
+  uint64_t blockSize,
+  MemoryPool& pool) {
+switch (static_cast(kind)) {
+case CompressionKind_NONE: {
+  return std::unique_ptr
+(new BufferedOutputStream(pool, outStream, bufferCapacity, 
blockSize));
+}
+case CompressionKind_ZLIB: {
+  int level = (strategy == CompressionStrategy_SPEED) ? -1 : 9;
--- End diff --

According to this - https://orc.apache.org/docs/hive-config.html, there are 
only two compression strategy defined: SPEED and COMPRESSION. I also checked 
Java implementation, SPEED maps to zlib level Z_BEST_SPEED + 1, and COMPRESSION 
maps to Z_DEFAULT_COMPRESSION. I will do the same for C++.

Java implementation for your reference -

WriterImpl.java

`
CompressionCodec result = physicalWriter.getCompressionCodec();
if (result != null) {
  switch (kind) {
case BLOOM_FILTER:
case DATA:
case DICTIONARY_DATA:
case BLOOM_FILTER_UTF8:
  if (compressionStrategy == OrcFile.CompressionStrategy.SPEED) {
result = 
result.modify(EnumSet.of(CompressionCodec.Modifier.FAST,
CompressionCodec.Modifier.TEXT));
  } else {
result = 
result.modify(EnumSet.of(CompressionCodec.Modifier.DEFAULT,
CompressionCodec.Modifier.TEXT));
  }
  break;
case LENGTH:
case DICTIONARY_COUNT:
case PRESENT:
case ROW_INDEX:
case SECONDARY:
  // easily compressed using the fastest modes
  result = 
result.modify(EnumSet.of(CompressionCodec.Modifier.FASTEST,
  CompressionCodec.Modifier.BINARY));
  break;
default:
  LOG.info("Missing ORC compression modifiers for " + kind);
  break;
  }
}

`

ZlibCodec.java

`
public CompressionCodec modify(/* @Nullable */ EnumSet modifiers) 
{

if (modifiers == null) {
  return this;
}

int l = this.level;
int s = this.strategy;

for (Modifier m : modifiers) {
  switch (m) {
  case BINARY:
/* filtered == less LZ77, more huffman */
s = Deflater.FILTERED;
break;
  case TEXT:
s = Deflater.DEFAULT_STRATEGY;
break;
  case FASTEST:
// deflate_fast looking for 8 byte patterns
l = Deflater.BEST_SPEED;
break;
  case FAST:
// deflate_fast looking for 16 byte patterns
l = Deflater.BEST_SPEED + 1;
break;
  case DEFAULT:
// deflate_slow looking for 128 byte patterns
l = Deflater.DEFAULT_COMPRESSION;
break;
  default:
break;
  }
}
return new ZlibCodec(l, s);
  }
`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #122: ORC-192 Implement zlib compresion stream

2017-05-20 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/122#discussion_r117622782
  
--- Diff: c++/src/Compression.cc ---
@@ -636,6 +884,33 @@ DIAGNOSTIC_POP
 return static_cast(result);
   }
 
+  std::unique_ptr
+ createCompressor(
+  CompressionKind kind,
+  OutputStream * outStream,
+  CompressionStrategy strategy,
+  uint64_t bufferCapacity,
+  uint64_t blockSize,
+  MemoryPool& pool) {
+switch (static_cast(kind)) {
+case CompressionKind_NONE: {
+  return std::unique_ptr
+(new BufferedOutputStream(pool, outStream, bufferCapacity, 
blockSize));
+}
+case CompressionKind_ZLIB: {
+  int level = (strategy == CompressionStrategy_SPEED) ? -1 : 9;
--- End diff --

what are the corresponding levels for these three settings? Are they 
documented anywhere?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #126: ORC-191 Implement RLE v1 encoder

2017-05-19 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/126
  
@omalley @majetideepak Can you also take a look at this? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #122: ORC-192 Implement zlib compresion stream

2017-05-19 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/122
  
@majetideepak @omalley please take another look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #122: ORC-192 Implement zlib compresion stream

2017-05-18 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/122#discussion_r117381049
  
--- Diff: c++/src/Compression.cc ---
@@ -33,6 +33,254 @@
 
 namespace orc {
 
+  class CompressionStreamBase: public BufferedOutputStream {
+  public:
+CompressionStreamBase(OutputStream * outStream,
+  int compressionLevel,
+  uint64_t capacity,
+  uint64_t blockSize,
+  MemoryPool& pool);
+
+virtual bool Next(void** data, int*size) override = 0;
+virtual void BackUp(int count) override;
+
+virtual std::string getName() const override = 0;
+virtual uint64_t flush() override;
+
+virtual bool isCompressed() const override { return true; }
+virtual uint64_t getSize() const override;
+
+  protected:
+void writeHeader(char * buffer, size_t compressedSize, bool original) {
+  buffer[0] = static_cast((compressedSize << 1) + (original ? 1 
: 0));
+  buffer[1] = static_cast(compressedSize >> 7);
+  buffer[2] = static_cast(compressedSize >> 15);
+}
+
+// Buffer to hold uncompressed data until user calls Next()
+DataBuffer rawInputBuffer;
+
+// Compress level
+int level;
+
+// Compressed data output buffer
+char * outputBuffer;
+
+// Size for compressionBuffer
+int bufferSize;
+
+// Compress output position
+int outputPosition;
+
+// Compress output buffer size
+int outputSize;
+  };
+
+  CompressionStreamBase::CompressionStreamBase(OutputStream * outStream,
+   int compressionLevel,
+   uint64_t capacity,
+   uint64_t blockSize,
+   MemoryPool& pool) :
+BufferedOutputStream(pool,
+ 
outStream,
+ 
capacity,
+ 
blockSize),
+rawInputBuffer(pool, 
blockSize),
+level(compressionLevel),
+outputBuffer(nullptr),
+bufferSize(0),
+outputPosition(0),
+outputSize(0) {
+// PASS
+  }
+
+  void CompressionStreamBase::BackUp(int count) {
+if (count > bufferSize) {
+  throw std::logic_error("Can't backup that much!");
+}
+bufferSize -= count;
+  }
+
+  uint64_t CompressionStreamBase::flush() {
+void * data;
+int size;
+if (!Next(, )) {
+  throw std::logic_error("Failed to flush compression buffer.");
--- End diff --

For now, I will just use std::runtime_error in compression path. We can 
come back and fix that later. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #122: ORC-192 Implement zlib compresion stream

2017-05-18 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/122#discussion_r117380313
  
--- Diff: c++/src/Compression.cc ---
@@ -33,6 +33,254 @@
 
 namespace orc {
 
+  class CompressionStreamBase: public BufferedOutputStream {
+  public:
+CompressionStreamBase(OutputStream * outStream,
+  int compressionLevel,
+  uint64_t capacity,
+  uint64_t blockSize,
+  MemoryPool& pool);
+
+virtual bool Next(void** data, int*size) override = 0;
+virtual void BackUp(int count) override;
+
+virtual std::string getName() const override = 0;
+virtual uint64_t flush() override;
+
+virtual bool isCompressed() const override { return true; }
+virtual uint64_t getSize() const override;
+
+  protected:
+void writeHeader(char * buffer, size_t compressedSize, bool original) {
+  buffer[0] = static_cast((compressedSize << 1) + (original ? 1 
: 0));
+  buffer[1] = static_cast(compressedSize >> 7);
+  buffer[2] = static_cast(compressedSize >> 15);
+}
+
+// Buffer to hold uncompressed data until user calls Next()
+DataBuffer rawInputBuffer;
+
+// Compress level
+int level;
+
+// Compressed data output buffer
+char * outputBuffer;
+
+// Size for compressionBuffer
+int bufferSize;
+
+// Compress output position
+int outputPosition;
+
+// Compress output buffer size
+int outputSize;
+  };
+
+  CompressionStreamBase::CompressionStreamBase(OutputStream * outStream,
+   int compressionLevel,
+   uint64_t capacity,
+   uint64_t blockSize,
+   MemoryPool& pool) :
+BufferedOutputStream(pool,
+ 
outStream,
+ 
capacity,
+ 
blockSize),
+rawInputBuffer(pool, 
blockSize),
+level(compressionLevel),
+outputBuffer(nullptr),
+bufferSize(0),
+outputPosition(0),
+outputSize(0) {
+// PASS
+  }
+
+  void CompressionStreamBase::BackUp(int count) {
+if (count > bufferSize) {
+  throw std::logic_error("Can't backup that much!");
+}
+bufferSize -= count;
+  }
+
+  uint64_t CompressionStreamBase::flush() {
+void * data;
+int size;
+if (!Next(, )) {
+  throw std::logic_error("Failed to flush compression buffer.");
--- End diff --

Or we keep RuntimeError and LogicError both defined in orc namespace to 
differentiate different types of errors. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #122: ORC-192 Implement zlib compresion stream

2017-05-18 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/122#discussion_r117379895
  
--- Diff: c++/src/Compression.cc ---
@@ -33,6 +33,254 @@
 
 namespace orc {
 
+  class CompressionStreamBase: public BufferedOutputStream {
+  public:
+CompressionStreamBase(OutputStream * outStream,
+  int compressionLevel,
+  uint64_t capacity,
+  uint64_t blockSize,
+  MemoryPool& pool);
+
+virtual bool Next(void** data, int*size) override = 0;
+virtual void BackUp(int count) override;
+
+virtual std::string getName() const override = 0;
+virtual uint64_t flush() override;
+
+virtual bool isCompressed() const override { return true; }
+virtual uint64_t getSize() const override;
+
+  protected:
+void writeHeader(char * buffer, size_t compressedSize, bool original) {
+  buffer[0] = static_cast((compressedSize << 1) + (original ? 1 
: 0));
+  buffer[1] = static_cast(compressedSize >> 7);
+  buffer[2] = static_cast(compressedSize >> 15);
+}
+
+// Buffer to hold uncompressed data until user calls Next()
+DataBuffer rawInputBuffer;
+
+// Compress level
+int level;
+
+// Compressed data output buffer
+char * outputBuffer;
+
+// Size for compressionBuffer
+int bufferSize;
+
+// Compress output position
+int outputPosition;
+
+// Compress output buffer size
+int outputSize;
+  };
+
+  CompressionStreamBase::CompressionStreamBase(OutputStream * outStream,
+   int compressionLevel,
+   uint64_t capacity,
+   uint64_t blockSize,
+   MemoryPool& pool) :
+BufferedOutputStream(pool,
+ 
outStream,
+ 
capacity,
+ 
blockSize),
+rawInputBuffer(pool, 
blockSize),
+level(compressionLevel),
+outputBuffer(nullptr),
+bufferSize(0),
+outputPosition(0),
+outputSize(0) {
+// PASS
+  }
+
+  void CompressionStreamBase::BackUp(int count) {
+if (count > bufferSize) {
+  throw std::logic_error("Can't backup that much!");
+}
+bufferSize -= count;
+  }
+
+  uint64_t CompressionStreamBase::flush() {
+void * data;
+int size;
+if (!Next(, )) {
+  throw std::logic_error("Failed to flush compression buffer.");
--- End diff --

We also throw std::logic_error in other places. If that's the goal, I'd 
suggest we use one single type of exception for all Orc code, e.g. 
OrcException. You can open a JIRA on this, and it can be done in a separate 
change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #122: ORC-192 Implement zlib compresion stream

2017-05-18 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/122#discussion_r117378369
  
--- Diff: c++/src/Compression.cc ---
@@ -33,6 +33,254 @@
 
 namespace orc {
 
+  class CompressionStreamBase: public BufferedOutputStream {
+  public:
+CompressionStreamBase(OutputStream * outStream,
+  int compressionLevel,
+  uint64_t capacity,
+  uint64_t blockSize,
+  MemoryPool& pool);
+
+virtual bool Next(void** data, int*size) override = 0;
+virtual void BackUp(int count) override;
+
+virtual std::string getName() const override = 0;
+virtual uint64_t flush() override;
+
+virtual bool isCompressed() const override { return true; }
+virtual uint64_t getSize() const override;
+
+  protected:
+void writeHeader(char * buffer, size_t compressedSize, bool original) {
+  buffer[0] = static_cast((compressedSize << 1) + (original ? 1 
: 0));
+  buffer[1] = static_cast(compressedSize >> 7);
+  buffer[2] = static_cast(compressedSize >> 15);
+}
+
+// Buffer to hold uncompressed data until user calls Next()
+DataBuffer rawInputBuffer;
+
+// Compress level
+int level;
+
+// Compressed data output buffer
+char * outputBuffer;
+
+// Size for compressionBuffer
+int bufferSize;
+
+// Compress output position
+int outputPosition;
+
+// Compress output buffer size
+int outputSize;
+  };
+
+  CompressionStreamBase::CompressionStreamBase(OutputStream * outStream,
+   int compressionLevel,
+   uint64_t capacity,
+   uint64_t blockSize,
+   MemoryPool& pool) :
+BufferedOutputStream(pool,
+ 
outStream,
+ 
capacity,
+ 
blockSize),
+rawInputBuffer(pool, 
blockSize),
+level(compressionLevel),
+outputBuffer(nullptr),
+bufferSize(0),
+outputPosition(0),
+outputSize(0) {
+// PASS
+  }
+
+  void CompressionStreamBase::BackUp(int count) {
+if (count > bufferSize) {
+  throw std::logic_error("Can't backup that much!");
+}
+bufferSize -= count;
+  }
+
+  uint64_t CompressionStreamBase::flush() {
+void * data;
+int size;
+if (!Next(, )) {
+  throw std::logic_error("Failed to flush compression buffer.");
--- End diff --

how about just using std::runtime_error? I still don't see a need to create 
yet another exception class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #122: ORC-192 Implement zlib compresion stream

2017-05-18 Thread xndai

Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/122#discussion_r117363862
  
--- Diff: c++/src/Compression.cc ---
@@ -636,6 +884,33 @@ DIAGNOSTIC_POP
 return static_cast(result);
   }
 
+  std::unique_ptr
+ createCompressor(
+  CompressionKind kind,
+  OutputStream * outStream,
+  CompressionStrategy strategy,
+  uint64_t bufferCapacity,
+  uint64_t blockSize,
+  MemoryPool& pool) {
+switch (static_cast(kind)) {
+case CompressionKind_NONE: {
+  return std::unique_ptr
+(new BufferedOutputStream(pool, outStream, bufferCapacity, 
blockSize));
+}
+case CompressionKind_ZLIB: {
+  int level = strategy == CompressionStrategy_SPEED ? 1 : 9;
--- End diff --

Will do. Thx.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #122: ORC-192 Implement zlib compresion stream

2017-05-18 Thread xndai

Github user xndai commented on the issue:

https://github.com/apache/orc/pull/122
  
@omalley can you please take a look at this? Thx.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

1 2 >

1 - 100 of 105 matches

Mail list logo