[GitHub] orc issue #308: Covert tool should create a lowercase schema

2018-10-02 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/308
  
Thinking about this more, I'd like to suggest that if the user asks for a 
lowercase schema that when the fields of a struct are parsed, they are 
lowercased there. That will prevent "aa" and "AA" both being converted to "aa", 
but not being merged into a single field.


---


[GitHub] orc issue #312: fix ORC-237 OrcFile.mergeFiles

2018-10-01 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/312
  
@yiyezhiqiu233 do you have an Apache jira ID? I'd like to assign ORC-237 to 
you.


---


[GitHub] orc pull request #315: ORC-411: Update pom file to work with openjdk 10.

2018-09-29 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/315

ORC-411: Update pom file to work with openjdk 10.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-411

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/315.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #315


commit 6be929f239dadd18ff35bff9c072fd315008354f
Author: Owen O'Malley 
Date:   2018-09-30T04:58:29Z

ORC-411: Update pom file to work with openjdk 10.




---


[GitHub] orc issue #308: Covert tool should create a lowercase schema

2018-09-21 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/308
  
Actually, this is doesn't match my experience. Hive doesn't support 
uppercase letters in the top level column names, but it does support them in 
sub-structs. Furthermore, ORC from other systems can handle the uppercase 
letters.

You might also try setting "orc.schema.evolution.case.sensitive" to false 
to get name-based matching of the ORC types that isn't case sensitive.

In terms of this patch, can you add a parameter that allows the user to 
downcase the column names?


---


[GitHub] orc pull request #309: ORC-403: [C++] Add checks to avoid invalid offsets in...

2018-09-17 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/309#discussion_r218244811
  
--- Diff: c++/src/Reader.cc ---
@@ -498,6 +498,12 @@ namespace orc {
   const proto::Stream& stream = currentStripeFooter.streams(i);
   uint64_t length = static_cast(stream.length());
   if (static_cast(stream.kind()) == 
StreamKind::StreamKind_ROW_INDEX) {
+if (offset + length > fileLength) {
--- End diff --

This check is really good, but it would also be nice to check that the 
stream is within the stripe, although you'd need to pass that in also. We could 
 pass the whole proto::StripeInformation in.


---


[GitHub] orc pull request #306: ORC-363: Enable zstd for java writer/reader

2018-09-14 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/306#discussion_r217801734
  
--- Diff: java/pom.xml ---
@@ -68,8 +68,8 @@
 ${project.build.directory}/testing-tmp
 ${project.basedir}/../../examples
 
-2.2.0
-2.7.3
+2.9.0
--- End diff --

We shouldn't bump the minimum version of Hadoop to 2.9. Let's instead put 
it into the shims. We'll need to add a method like createZstdCodec to 
HadoopShims. We'll also need a new implementation for HadoopShimsPre2_9 that 
handles variable length blocks, but not zstd. You'll also need a graceful 
message if they don't have the native libraries available.


---


[GitHub] orc issue #299: ORC-203 - Update StringStatistics to trim long strings to 10...

2018-09-11 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/299
  
I simplified the code to get the upper bound. I also added 4 test cases to 
test the different number of bytes in the utf-8 characters.
See https://github.com/omalley/orc/tree/orc-203



---


[GitHub] orc issue #305: ORC-399. Move Java compiler to version 8.

2018-09-01 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/305
  
It looks like the Calcite pom were pulling in some jar that is missing from 
Maven Central.


---


[GitHub] orc pull request #305: ORC-399. Move Java compiler to version 8.

2018-08-31 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/305

ORC-399. Move Java compiler to version 8.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-399

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/305.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #305


commit f093118410710c750d84d78d18c7d952254dda09
Author: Owen O'Malley 
Date:   2018-08-31T22:21:49Z

ORC-399. Move Java compiler to version 8.




---


[GitHub] orc issue #299: ORC-203 - Update StringStatistics to trim long strings to 10...

2018-08-29 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/299
  
Please make a couple more changes:
* Revert the change to java/bench/README.md.
* We should add a new WriterVersion (ORC_203) so that if there is a bug in 
the new code, we can try to recover in the reader.


---


[GitHub] orc pull request #304: ORC-397. Allow selective disabling of dictionary enco...

2018-08-29 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/304#discussion_r213566036
  
--- Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java ---
@@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary() 
throws Exception {
 
   }
 
+  /**
+   * Test that dictionaries can be disabled, per column. In this test, we 
want to disable DICTIONARY_V2 for the
+   * `longString` column (presumably for a low hit-ratio), while 
preserving DICTIONARY_V2 for `shortString`.
+   * @throws Exception on unexpected failure
+   */
+  @Test
+  public void testDisableDictionaryForSpecificColumn() throws Exception {
+final String SHORT_STRING_VALUE = "foo";
+final String  LONG_STRING_VALUE = "BR!!";
+
+TypeDescription schema =
+
TypeDescription.fromString("struct");
+
+Writer writer = OrcFile.createWriter(
+testFilePath,
+OrcFile.writerOptions(conf).setSchema(schema)
+.compress(CompressionKind.NONE)
+.bufferSize(1)
+.directEncodingColumns("longString"));
--- End diff --

We shouldn't change the default behavior of deciding between 
direct/dictionary encoding based on the data. With the parameter being the 
columns that are being forced to be direct, the default is the empty string, 
which is straight-forward.

Does that make sense?


---


[GitHub] orc issue #304: ORC-397. Allow selective disabling of dictionary encoding.

2018-08-28 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/304
  
I pushed an update that simplified the test case a bit.


---


[GitHub] orc issue #304: ORC-397. Allow selective disabling of dictionary encoding.

2018-08-28 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/304
  
Oh, rather than fix WriterImplV2 I simplified it to remove the duplicated 
code.


---


[GitHub] orc pull request #304: ORC-397. Allow selective disabling of dictionary enco...

2018-08-28 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/304

ORC-397. Allow selective disabling of dictionary encoding.

I've forward ported the patch from Mithun.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-397

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/304.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #304


commit b00a52e668efa9fb06821d7da52706ace81e31c0
Author: Owen O'Malley 
Date:   2018-08-28T23:12:05Z

ORC-397. Allow selective disabling of dictionary encoding.




---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208394170
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -683,6 +787,150 @@ public int hashCode() {
   result = 31 * result + (int) (sum ^ (sum >>> 32));
   return result;
 }
+
+/**
+ * A helper function that truncates the {@link Text} input
+ * based on {@link #MAX_BYTES_RECORDED} and increments
+ * the last codepoint by 1.
+ * @param text
+ * @return truncated Text value
+ */
+private static Text truncateUpperBound(final Text text) {
+
+  if(text.getBytes().length > MAX_BYTES_RECORDED) {
+return truncateUpperBound(text.getBytes());
+  } else {
+return text;
+  }
+
+}
+
+/**
+ * A helper function that truncates the {@link byte[]} input
+ * based on {@link #MAX_BYTES_RECORDED} and increments
+ * the last codepoint by 1.
+ * @param text
+ * @return truncated Text value
+ */
+private static Text truncateUpperBound(final byte[] text) {
+  if(text.length > MAX_BYTES_RECORDED) {
+final Text truncated = truncateLowerBound(text);
+final byte[] data = truncated.getBytes();
+
+int lastCharPosition = data.length - 1;
+int offset = 0;
+
+/* we don't expect characters more than 5 bytes */
+for (int i = 0; i < 5; i++) {
+  final byte b = data[lastCharPosition];
+  offset = getCharLength(b);
+
+  /* found beginning of a valid char */
+  if (offset > 0) {
+final byte[] lastCharBytes = Arrays
+.copyOfRange(text, lastCharPosition, lastCharPosition + 
offset);
+/* last character */
+final String s = new String(lastCharBytes, 
Charset.forName("UTF-8"));
+
+/* increment the codepoint of last character */
+int codePoint = s.codePointAt(s.length() - 1);
+codePoint++;
+final char[] incrementedChars = Character.toChars(codePoint);
+
+/* convert char array to byte array */
+final CharBuffer charBuffer = 
CharBuffer.wrap(incrementedChars);
+final ByteBuffer byteBuffer = 
Charset.forName("UTF-8").encode(charBuffer);
+final byte[] bytes = Arrays.copyOfRange(byteBuffer.array(), 
byteBuffer.position(),
+byteBuffer.limit());
+
+final byte[] result = new byte[lastCharPosition + 
bytes.length];
+
+/* copy truncated array minus last char */
+System.arraycopy(text, 0, result, 0, lastCharPosition);
+/* copy last char */
+System.arraycopy(bytes, 0, result, lastCharPosition, 
bytes.length);
+
+return new Text(result);
+
+  } /* not found keep looking for a beginning byte */ else {
+--lastCharPosition;
+  }
+
+}
+/* beginning of a valid char not found */
+throw new IllegalArgumentException(
+"Could not truncate string, beginning of a valid char not 
found");
+  } else {
+return new Text(text);
+  }
+}
+
+private static Text truncateLowerBound(final Text text) {
+  if(text.getBytes().length > MAX_BYTES_RECORDED) {
+return truncateLowerBound(text.getBytes());
+  } else {
+return text;
+  }
+}
+
+
+private static Text truncateLowerBound(final byte[] text) {
--- End diff --

You need to pass the length in here as well. The byte array may be longer 
than the length of the data.


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208402722
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -683,6 +787,150 @@ public int hashCode() {
   result = 31 * result + (int) (sum ^ (sum >>> 32));
   return result;
 }
+
+/**
+ * A helper function that truncates the {@link Text} input
+ * based on {@link #MAX_BYTES_RECORDED} and increments
+ * the last codepoint by 1.
+ * @param text
+ * @return truncated Text value
+ */
+private static Text truncateUpperBound(final Text text) {
+
+  if(text.getBytes().length > MAX_BYTES_RECORDED) {
+return truncateUpperBound(text.getBytes());
+  } else {
+return text;
+  }
+
+}
+
+/**
+ * A helper function that truncates the {@link byte[]} input
+ * based on {@link #MAX_BYTES_RECORDED} and increments
+ * the last codepoint by 1.
+ * @param text
+ * @return truncated Text value
+ */
+private static Text truncateUpperBound(final byte[] text) {
+  if(text.length > MAX_BYTES_RECORDED) {
+final Text truncated = truncateLowerBound(text);
+final byte[] data = truncated.getBytes();
+
+int lastCharPosition = data.length - 1;
+int offset = 0;
+
+/* we don't expect characters more than 5 bytes */
+for (int i = 0; i < 5; i++) {
+  final byte b = data[lastCharPosition];
+  offset = getCharLength(b);
+
+  /* found beginning of a valid char */
+  if (offset > 0) {
+final byte[] lastCharBytes = Arrays
+.copyOfRange(text, lastCharPosition, lastCharPosition + 
offset);
+/* last character */
+final String s = new String(lastCharBytes, 
Charset.forName("UTF-8"));
+
+/* increment the codepoint of last character */
+int codePoint = s.codePointAt(s.length() - 1);
+codePoint++;
+final char[] incrementedChars = Character.toChars(codePoint);
+
+/* convert char array to byte array */
+final CharBuffer charBuffer = 
CharBuffer.wrap(incrementedChars);
+final ByteBuffer byteBuffer = 
Charset.forName("UTF-8").encode(charBuffer);
+final byte[] bytes = Arrays.copyOfRange(byteBuffer.array(), 
byteBuffer.position(),
+byteBuffer.limit());
+
+final byte[] result = new byte[lastCharPosition + 
bytes.length];
+
+/* copy truncated array minus last char */
+System.arraycopy(text, 0, result, 0, lastCharPosition);
+/* copy last char */
+System.arraycopy(bytes, 0, result, lastCharPosition, 
bytes.length);
+
+return new Text(result);
--- End diff --

A better pattern is:

Text result = new Text();
result.setCapacity(lastCharPosition + bytes.length);
result.set(text, 0, lastCharPosition);
result.append(bytes, 0, bytes.length);
return result;


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208398251
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -584,16 +630,40 @@ public void merge(ColumnStatisticsImpl other) {
   if (str.minimum != null) {
 maximum = new Text(str.getMaximum());
 minimum = new Text(str.getMinimum());
-  } else {
+  }
+  /* str.minimum == null when lower bound set */
+  else if (str.isLowerBoundSet) {
+minimum = new Text(str.getLowerBound());
+isLowerBoundSet = true;
+
+/* check for upper bound before setting max */
+if (str.isUpperBoundSet) {
+  maximum = new Text(str.getUpperBound());
+  isUpperBoundSet = true;
+} else {
+  maximum = new Text(str.getMaximum());
+}
+  }
+  else {
   /* both are empty */
 maximum = minimum = null;
   }
 } else if (str.minimum != null) {
   if (minimum.compareTo(str.minimum) > 0) {
-minimum = new Text(str.getMinimum());
+if(str.isLowerBoundSet) {
+  minimum = new Text(str.getLowerBound());
+  isLowerBoundSet = true;
+} else {
+  minimum = new Text(str.getMinimum());
+}
   }
   if (maximum.compareTo(str.maximum) < 0) {
-maximum = new Text(str.getMaximum());
+if(str.isUpperBoundSet) {
--- End diff --

The same simplification would hold here.
maximum = new Text(str.maximum)
isUpperBoundSet = str.isUpperBoundSet


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208392941
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -543,35 +551,73 @@ public void reset() {
   super.reset();
   minimum = null;
   maximum = null;
+  isLowerBoundSet = false;
+  isUpperBoundSet = false;
   sum = 0;
 }
 
 @Override
 public void updateString(Text value) {
   if (minimum == null) {
-maximum = minimum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+ minimum = truncateLowerBound(value);
+ maximum = truncateUpperBound(value);
+ isLowerBoundSet = true;
+ isUpperBoundSet = true;
+} else {
+  maximum = minimum = new Text(value);
+}
   } else if (minimum.compareTo(value) > 0) {
-minimum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+  minimum = truncateLowerBound(value);
+  isLowerBoundSet = true;
+}else {
+  minimum = new Text(value);
+}
   } else if (maximum.compareTo(value) < 0) {
-maximum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+  maximum = truncateUpperBound(value);
+  isUpperBoundSet = true;
+} else {
+  maximum = new Text(value);
--- End diff --

This also needs to clear isUpperBoundSet.


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208397080
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -584,16 +630,40 @@ public void merge(ColumnStatisticsImpl other) {
   if (str.minimum != null) {
 maximum = new Text(str.getMaximum());
 minimum = new Text(str.getMinimum());
-  } else {
+  }
--- End diff --

Ok, you should keep the "else" on the same line as the "}", but I don't 
think you need this new branch.


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208392060
  
--- Diff: java/bench/README.md ---
@@ -1,3 +1,20 @@
+
--- End diff --

I don't need this, even when I compile with ANALYZE_JAVA.


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208401986
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -683,6 +787,150 @@ public int hashCode() {
   result = 31 * result + (int) (sum ^ (sum >>> 32));
   return result;
 }
+
+/**
+ * A helper function that truncates the {@link Text} input
+ * based on {@link #MAX_BYTES_RECORDED} and increments
+ * the last codepoint by 1.
+ * @param text
+ * @return truncated Text value
+ */
+private static Text truncateUpperBound(final Text text) {
+
+  if(text.getBytes().length > MAX_BYTES_RECORDED) {
+return truncateUpperBound(text.getBytes());
+  } else {
+return text;
+  }
+
+}
+
+/**
+ * A helper function that truncates the {@link byte[]} input
+ * based on {@link #MAX_BYTES_RECORDED} and increments
+ * the last codepoint by 1.
+ * @param text
+ * @return truncated Text value
+ */
+private static Text truncateUpperBound(final byte[] text) {
+  if(text.length > MAX_BYTES_RECORDED) {
+final Text truncated = truncateLowerBound(text);
+final byte[] data = truncated.getBytes();
+
+int lastCharPosition = data.length - 1;
+int offset = 0;
+
+/* we don't expect characters more than 5 bytes */
+for (int i = 0; i < 5; i++) {
+  final byte b = data[lastCharPosition];
+  offset = getCharLength(b);
+
+  /* found beginning of a valid char */
+  if (offset > 0) {
+final byte[] lastCharBytes = Arrays
+.copyOfRange(text, lastCharPosition, lastCharPosition + 
offset);
+/* last character */
+final String s = new String(lastCharBytes, 
Charset.forName("UTF-8"));
+
+/* increment the codepoint of last character */
+int codePoint = s.codePointAt(s.length() - 1);
+codePoint++;
+final char[] incrementedChars = Character.toChars(codePoint);
+
+/* convert char array to byte array */
+final CharBuffer charBuffer = 
CharBuffer.wrap(incrementedChars);
+final ByteBuffer byteBuffer = 
Charset.forName("UTF-8").encode(charBuffer);
+final byte[] bytes = Arrays.copyOfRange(byteBuffer.array(), 
byteBuffer.position(),
+byteBuffer.limit());
+
+final byte[] result = new byte[lastCharPosition + 
bytes.length];
+
+/* copy truncated array minus last char */
+System.arraycopy(text, 0, result, 0, lastCharPosition);
+/* copy last char */
+System.arraycopy(bytes, 0, result, lastCharPosition, 
bytes.length);
+
+return new Text(result);
+
+  } /* not found keep looking for a beginning byte */ else {
+--lastCharPosition;
+  }
+
+}
+/* beginning of a valid char not found */
+throw new IllegalArgumentException(
+"Could not truncate string, beginning of a valid char not 
found");
+  } else {
+return new Text(text);
+  }
+}
+
+private static Text truncateLowerBound(final Text text) {
+  if(text.getBytes().length > MAX_BYTES_RECORDED) {
+return truncateLowerBound(text.getBytes());
+  } else {
+return text;
+  }
+}
+
+
+private static Text truncateLowerBound(final byte[] text) {
+
+  if(text.length > MAX_BYTES_RECORDED) {
+
+int truncateLen = MAX_BYTES_RECORDED;
+int offset = 0;
+
+for(int i=0; i<5; i++) {
+
+  byte b = text[truncateLen];
+  /* check for the beginning of 1,2,3,4,5 bytes long char */
+  offset = getCharLength(b);
+
+  /* found beginning of a valid char */
+  if(offset > 0) {
+byte[] truncated = Arrays.copyOfRange(text, 0, (truncateLen));
--- End diff --

If you do:
Text result = new Text();
result.set(text, 0, truncateLen);
return result;

you'll have 1 less copy of the bytes.


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208393137
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -543,35 +551,73 @@ public void reset() {
   super.reset();
   minimum = null;
   maximum = null;
+  isLowerBoundSet = false;
+  isUpperBoundSet = false;
   sum = 0;
 }
 
 @Override
 public void updateString(Text value) {
   if (minimum == null) {
-maximum = minimum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+ minimum = truncateLowerBound(value);
+ maximum = truncateUpperBound(value);
+ isLowerBoundSet = true;
+ isUpperBoundSet = true;
+} else {
+  maximum = minimum = new Text(value);
+}
   } else if (minimum.compareTo(value) > 0) {
-minimum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+  minimum = truncateLowerBound(value);
+  isLowerBoundSet = true;
+}else {
+  minimum = new Text(value);
+}
   } else if (maximum.compareTo(value) < 0) {
-maximum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+  maximum = truncateUpperBound(value);
+  isUpperBoundSet = true;
+} else {
+  maximum = new Text(value);
+}
   }
   sum += value.getLength();
 }
 
+
 @Override
 public void updateString(byte[] bytes, int offset, int length,
  int repetitions) {
+  byte[] input = Arrays.copyOfRange(bytes, offset, offset+(length));
--- End diff --

Please don't copy the array unless it is needed.


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208392855
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -543,35 +551,73 @@ public void reset() {
   super.reset();
   minimum = null;
   maximum = null;
+  isLowerBoundSet = false;
+  isUpperBoundSet = false;
   sum = 0;
 }
 
 @Override
 public void updateString(Text value) {
   if (minimum == null) {
-maximum = minimum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+ minimum = truncateLowerBound(value);
+ maximum = truncateUpperBound(value);
+ isLowerBoundSet = true;
+ isUpperBoundSet = true;
+} else {
+  maximum = minimum = new Text(value);
+}
   } else if (minimum.compareTo(value) > 0) {
-minimum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+  minimum = truncateLowerBound(value);
+  isLowerBoundSet = true;
+}else {
+  minimum = new Text(value);
--- End diff --

This needs to set isLowerBoundSet = false.


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208398373
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -683,6 +787,150 @@ public int hashCode() {
   result = 31 * result + (int) (sum ^ (sum >>> 32));
   return result;
 }
+
+/**
+ * A helper function that truncates the {@link Text} input
+ * based on {@link #MAX_BYTES_RECORDED} and increments
+ * the last codepoint by 1.
+ * @param text
+ * @return truncated Text value
+ */
+private static Text truncateUpperBound(final Text text) {
+
+  if(text.getBytes().length > MAX_BYTES_RECORDED) {
+return truncateUpperBound(text.getBytes());
+  } else {
+return text;
+  }
+
+}
+
+/**
+ * A helper function that truncates the {@link byte[]} input
+ * based on {@link #MAX_BYTES_RECORDED} and increments
+ * the last codepoint by 1.
+ * @param text
+ * @return truncated Text value
+ */
+private static Text truncateUpperBound(final byte[] text) {
--- End diff --

You need the length passed in here.


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208395531
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -584,16 +630,40 @@ public void merge(ColumnStatisticsImpl other) {
   if (str.minimum != null) {
 maximum = new Text(str.getMaximum());
--- End diff --

You'll need to copy the booleans from str also.


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208393268
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -543,35 +551,73 @@ public void reset() {
   super.reset();
   minimum = null;
   maximum = null;
+  isLowerBoundSet = false;
+  isUpperBoundSet = false;
   sum = 0;
 }
 
 @Override
 public void updateString(Text value) {
   if (minimum == null) {
-maximum = minimum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+ minimum = truncateLowerBound(value);
+ maximum = truncateUpperBound(value);
+ isLowerBoundSet = true;
+ isUpperBoundSet = true;
+} else {
+  maximum = minimum = new Text(value);
+}
   } else if (minimum.compareTo(value) > 0) {
-minimum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+  minimum = truncateLowerBound(value);
+  isLowerBoundSet = true;
+}else {
+  minimum = new Text(value);
+}
   } else if (maximum.compareTo(value) < 0) {
-maximum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+  maximum = truncateUpperBound(value);
+  isUpperBoundSet = true;
+} else {
+  maximum = new Text(value);
+}
   }
   sum += value.getLength();
 }
 
+
 @Override
 public void updateString(byte[] bytes, int offset, int length,
  int repetitions) {
+  byte[] input = Arrays.copyOfRange(bytes, offset, offset+(length));
   if (minimum == null) {
-maximum = minimum = new Text();
-maximum.set(bytes, offset, length);
+if(length > MAX_BYTES_RECORDED) {
+  minimum = truncateLowerBound(input);
+  maximum = truncateUpperBound(input);
+  isLowerBoundSet = true;
+  isUpperBoundSet = true;
+} else {
+  maximum = minimum = new Text();
+  maximum.set(bytes, offset, length);
+}
   } else if (WritableComparator.compareBytes(minimum.getBytes(), 0,
   minimum.getLength(), bytes, offset, length) > 0) {
-minimum = new Text();
-minimum.set(bytes, offset, length);
+if(length > MAX_BYTES_RECORDED) {
+  minimum = truncateLowerBound(input);
+  isLowerBoundSet = true;
+} else {
+  minimum = new Text();
+  minimum.set(bytes, offset, length);
--- End diff --

Clear  isLowerBoundSet.


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208392450
  
--- Diff: java/core/src/java/org/apache/orc/StringColumnStatistics.java ---
@@ -33,6 +33,20 @@
*/
   String getMaximum();
 
+  /**
+   * Get the string with
+   * length = Min(StringStatisticsImpl.MAX_BYTES_RECORDED, getMinimum())
--- End diff --

I would phrase this as "Get the lower bound of the values in this column. 
The value may be truncated to at most MAX_BYTES_RECORDED."


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208392508
  
--- Diff: java/core/src/java/org/apache/orc/StringColumnStatistics.java ---
@@ -33,6 +33,20 @@
*/
   String getMaximum();
 
+  /**
+   * Get the string with
+   * length = Min(StringStatisticsImpl.MAX_BYTES_RECORDED, getMinimum())
+   * @return lower bound
+   */
+  String getLowerBound();
+
+  /**
+   * Get the string with
+   * length = Min(StringStatisticsImpl.MAX_BYTES_RECORDED, getMaximum())
--- End diff --

Same here.


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208398032
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -584,16 +630,40 @@ public void merge(ColumnStatisticsImpl other) {
   if (str.minimum != null) {
 maximum = new Text(str.getMaximum());
 minimum = new Text(str.getMinimum());
-  } else {
+  }
+  /* str.minimum == null when lower bound set */
+  else if (str.isLowerBoundSet) {
+minimum = new Text(str.getLowerBound());
+isLowerBoundSet = true;
+
+/* check for upper bound before setting max */
+if (str.isUpperBoundSet) {
+  maximum = new Text(str.getUpperBound());
+  isUpperBoundSet = true;
+} else {
+  maximum = new Text(str.getMaximum());
+}
+  }
+  else {
   /* both are empty */
 maximum = minimum = null;
   }
 } else if (str.minimum != null) {
   if (minimum.compareTo(str.minimum) > 0) {
-minimum = new Text(str.getMinimum());
+if(str.isLowerBoundSet) {
+  minimum = new Text(str.getLowerBound());
+  isLowerBoundSet = true;
+} else {
+  minimum = new Text(str.getMinimum());
--- End diff --

You could simplify this as:
minimum = new Text(str.minimum)
isLowerBoundSet = str.isLowerBoundSet;


---


[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...

2018-08-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/299#discussion_r208393342
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -543,35 +551,73 @@ public void reset() {
   super.reset();
   minimum = null;
   maximum = null;
+  isLowerBoundSet = false;
+  isUpperBoundSet = false;
   sum = 0;
 }
 
 @Override
 public void updateString(Text value) {
   if (minimum == null) {
-maximum = minimum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+ minimum = truncateLowerBound(value);
+ maximum = truncateUpperBound(value);
+ isLowerBoundSet = true;
+ isUpperBoundSet = true;
+} else {
+  maximum = minimum = new Text(value);
+}
   } else if (minimum.compareTo(value) > 0) {
-minimum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+  minimum = truncateLowerBound(value);
+  isLowerBoundSet = true;
+}else {
+  minimum = new Text(value);
+}
   } else if (maximum.compareTo(value) < 0) {
-maximum = new Text(value);
+if(value.getLength() > MAX_BYTES_RECORDED) {
+  maximum = truncateUpperBound(value);
+  isUpperBoundSet = true;
+} else {
+  maximum = new Text(value);
+}
   }
   sum += value.getLength();
 }
 
+
 @Override
 public void updateString(byte[] bytes, int offset, int length,
  int repetitions) {
+  byte[] input = Arrays.copyOfRange(bytes, offset, offset+(length));
   if (minimum == null) {
-maximum = minimum = new Text();
-maximum.set(bytes, offset, length);
+if(length > MAX_BYTES_RECORDED) {
+  minimum = truncateLowerBound(input);
+  maximum = truncateUpperBound(input);
+  isLowerBoundSet = true;
+  isUpperBoundSet = true;
+} else {
+  maximum = minimum = new Text();
+  maximum.set(bytes, offset, length);
+}
   } else if (WritableComparator.compareBytes(minimum.getBytes(), 0,
   minimum.getLength(), bytes, offset, length) > 0) {
-minimum = new Text();
-minimum.set(bytes, offset, length);
+if(length > MAX_BYTES_RECORDED) {
+  minimum = truncateLowerBound(input);
+  isLowerBoundSet = true;
+} else {
+  minimum = new Text();
+  minimum.set(bytes, offset, length);
+}
   } else if (WritableComparator.compareBytes(maximum.getBytes(), 0,
   maximum.getLength(), bytes, offset, length) < 0) {
-maximum = new Text();
-maximum.set(bytes, offset, length);
+if(length > MAX_BYTES_RECORDED) {
+  maximum = truncateUpperBound(input);
+  isUpperBoundSet = true;
+} else {
+  maximum = new Text();
+  maximum.set(bytes, offset, length);
--- End diff --

Clear isUpperBoundSet.


---


[GitHub] orc issue #292: ORC-203 - Update StringStatistics to trim long strings to 10...

2018-08-06 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/292
  
I'm sorry, I accidentally pushed this while it is still in review. I've 
reverted the change, but please reopen this PR.


---


[GitHub] orc pull request #298: ORC-393. Add ORC snapcraft definition.

2018-08-06 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/298

ORC-393. Add ORC snapcraft definition.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-393

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/298.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #298


commit 9c82a14d8dd276fe4e3afcc87809c8ee9e97f843
Author: Owen O'Malley 
Date:   2018-08-03T19:53:37Z

ORC-393. Add ORC snapcraft definition.




---


[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...

2018-07-25 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/292#discussion_r205179930
  
--- Diff: java/bench/README.md ---
@@ -1,3 +1,20 @@
+
--- End diff --

Let me check why this got flagged. We should have an exclusion for *.md 
files.


---


[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...

2018-07-25 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/292#discussion_r205179439
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -621,12 +703,54 @@ public void merge(ColumnStatisticsImpl other) {
 
 @Override
 public String getMinimum() {
-  return minimum == null ? null : minimum.toString();
+  /* if we have lower bound set (in case of truncation)
+  getMinimum will be null */
+  if(isLowerBoundSet) {
+return null;
+  } else {
+return minimum == null ? null : minimum.toString();
+  }
 }
 
 @Override
 public String getMaximum() {
-  return maximum == null ? null : maximum.toString();
+  /* if we have upper bound is set (in case of truncation)
+  getMaximum will be null */
+  if(isUpperBoundSet) {
+return null;
+  } else {
+return maximum == null ? null : maximum.toString();
+  }
+}
+
+/**
+ * Get the string with
+ * length = Min(StringStatisticsImpl.MAX_STRING_LENGTH_RECORDED, 
getMinimum())
+ *
+ * @return lower bound
+ */
+@Override
+public String getLowerBound() {
+  if(isLowerBoundSet) {
+return minimum.toString();
+  } else {
+return null;
+  }
+}
+
+/**
+ * Get the string with
+ * length = Min(StringStatisticsImpl.MAX_STRING_LENGTH_RECORDED, 
getMaximum())
+ *
+ * @return upper bound
+ */
+@Override
+public String getUpperBound() {
+  if(isUpperBoundSet) {
+return maximum.toString();
+  } else {
+return null;
--- End diff --

This should always return maximum.


---


[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...

2018-07-25 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/292#discussion_r205180384
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -517,10 +519,14 @@ public int hashCode() {
 
   protected static final class StringStatisticsImpl extends 
ColumnStatisticsImpl
   implements StringColumnStatistics {
+public static final int MAX_STRING_LENGTH_RECORDED = 1024;
--- End diff --

We should probably use the number of bytes instead of the number of 
characters. I'd propose that we always truncate a whole number of characters, 
with a maximum of 1024 bytes.

For ASCII strings, it will have the same result, but larger characters may 
end up at 1021 to 1024 bytes long.


---


[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...

2018-07-25 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/292#discussion_r205180578
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -543,35 +549,87 @@ public void reset() {
   super.reset();
   minimum = null;
   maximum = null;
+  isLowerBoundSet = false;
+  isUpperBoundSet = false;
   sum = 0;
 }
 
 @Override
 public void updateString(Text value) {
   if (minimum == null) {
-maximum = minimum = new Text(value);
+if(value.toString().length() > MAX_STRING_LENGTH_RECORDED) {
+ minimum = new Text(
--- End diff --

Let's make a corresponding truncateLowerBound so that we can have one place 
to change.


---


[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...

2018-07-25 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/292#discussion_r205179355
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -621,12 +703,54 @@ public void merge(ColumnStatisticsImpl other) {
 
 @Override
 public String getMinimum() {
-  return minimum == null ? null : minimum.toString();
+  /* if we have lower bound set (in case of truncation)
+  getMinimum will be null */
+  if(isLowerBoundSet) {
+return null;
+  } else {
+return minimum == null ? null : minimum.toString();
+  }
 }
 
 @Override
 public String getMaximum() {
-  return maximum == null ? null : maximum.toString();
+  /* if we have upper bound is set (in case of truncation)
+  getMaximum will be null */
+  if(isUpperBoundSet) {
+return null;
+  } else {
+return maximum == null ? null : maximum.toString();
+  }
+}
+
+/**
+ * Get the string with
+ * length = Min(StringStatisticsImpl.MAX_STRING_LENGTH_RECORDED, 
getMinimum())
+ *
+ * @return lower bound
+ */
+@Override
+public String getLowerBound() {
+  if(isLowerBoundSet) {
+return minimum.toString();
+  } else {
+return null;
--- End diff --

This should always return minimum.


---


[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...

2018-07-25 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/292#discussion_r205181230
  
--- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
---
@@ -543,35 +549,87 @@ public void reset() {
   super.reset();
   minimum = null;
   maximum = null;
+  isLowerBoundSet = false;
+  isUpperBoundSet = false;
   sum = 0;
 }
 
 @Override
 public void updateString(Text value) {
   if (minimum == null) {
-maximum = minimum = new Text(value);
+if(value.toString().length() > MAX_STRING_LENGTH_RECORDED) {
+ minimum = new Text(
--- End diff --

We can even use the structure of UTF-8 to do truncateLowerBound without 
forming a string. In particular you want to search from 1024 to find the first 
byte of a character (0xxx  or 11xx ) and truncate just before it.


---


[GitHub] orc pull request #291: ORC-385. Change RecordReader to extend Closeable.

2018-07-13 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/291

ORC-385. Change RecordReader to extend Closeable.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-385

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/291.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #291


commit 903a6d6049f918566938f9621fcfadb5f146a2ac
Author: Owen O'Malley 
Date:   2018-07-13T21:55:45Z

ORC-385. Change RecordReader to extend Closeable.

Signed-off-by: Owen O'Malley 




---


[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...

2018-07-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/278#discussion_r202473870
  
--- Diff: java/core/src/java/org/apache/orc/impl/InStream.java ---
@@ -401,84 +596,149 @@ private String rangeString() {
 @Override
 public String toString() {
   return "compressed stream " + name + " position: " + currentOffset +
-  " length: " + length + " range: " + currentRange +
-  " offset: " + (compressed == null ? 0 : compressed.position()) + 
" limit: " + (compressed == null ? 0 : compressed.limit()) +
+  " length: " + length + " range: " + getRangeNumber(bytes, 
currentRange) +
+  " offset: " + (compressed == null ? 0 : compressed.position()) +
+  " limit: " + (compressed == null ? 0 : compressed.limit()) +
   rangeString() +
   (uncompressed == null ? "" :
   " uncompressed: " + uncompressed.position() + " to " +
   uncompressed.limit());
 }
   }
 
+  private static class EncryptedCompressedStream extends CompressedStream {
+private final EncryptionState encrypt;
+
+public EncryptedCompressedStream(String name,
+ DiskRangeList input,
+ long length,
+ StreamOptions options) {
+  super(name, length, options);
+  encrypt = new EncryptionState(name, options);
+  reset(input, length);
+}
+
+@Override
+protected void setCurrent(DiskRangeList newRange, boolean isJump) {
+  currentRange = newRange;
+  if (newRange != null) {
+if (isJump) {
+  encrypt.changeIv(newRange.getOffset());
+}
+compressed = encrypt.decrypt(newRange);
+compressed.position((int) (currentOffset - newRange.getOffset()));
+  }
+}
+
+@Override
+public void close() {
+  super.close();
+  encrypt.close();
+}
+
+@Override
+public String toString() {
+  return "encrypted " + super.toString();
+}
+  }
+
   public abstract void seek(PositionProvider index) throws IOException;
 
+  public static class StreamOptions implements Cloneable {
+private CompressionCodec codec;
+private int bufferSize;
+private EncryptionAlgorithm algorithm;
+private Key key;
+private byte[] iv;
+
+public StreamOptions withCodec(CompressionCodec value) {
+  this.codec = value;
+  return this;
+}
+
+public StreamOptions withBufferSize(int value) {
+  bufferSize = value;
+  return this;
+}
+
+public StreamOptions withEncryption(EncryptionAlgorithm algorithm,
+Key key,
+byte[] iv) {
+  this.algorithm = algorithm;
+  this.key = key;
+  this.iv = iv;
+  return this;
+}
+
+public CompressionCodec getCodec() {
+  return codec;
+}
+
+@Override
+public StreamOptions clone() {
+  try {
+StreamOptions clone = (StreamOptions) super.clone();
+if (clone.codec != null) {
+  // Make sure we don't share the same codec between two readers.
+  clone.codec = OrcCodecPool.getCodec(codec.getKind());
+}
+return clone;
--- End diff --

*sigh* It is Cloneable, because it is used as a field in DefaultDataReader. 
The clone is correct, I believe. We really need a better solution for 
CompressionCodec than the current Cloneable and CompressionCodecPool mess. We 
need to remove the compression parameters out of Codec and then they are 
completely stateless again and can be trivially pooled.


---


[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...

2018-07-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/278#discussion_r202471794
  
--- Diff: java/core/src/java/org/apache/orc/impl/writer/StreamOptions.java 
---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl.writer;
+
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.EncryptionAlgorithm;
+
+import java.security.Key;
+
+/**
+ * The compression and encryption options for writing a stream.
+ */
+public class StreamOptions {
--- End diff --

It is org.apache.orc.impl.InStream.StreamOptions and 
org.apache.orc.impl.writer.StreamOptions. I guess they should be named the same 
way, but they are separate classes.


---


[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...

2018-07-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/278#discussion_r202470866
  
--- Diff: java/core/src/java/org/apache/orc/impl/InStream.java ---
@@ -173,33 +203,208 @@ private static ByteBuffer allocateBuffer(int size, 
boolean isDirect) {
 }
   }
 
+  /**
+   * Manage the state of the decryption, including the ability to seek.
+   */
+  static class EncryptionState {
+private final String name;
+private final EncryptionAlgorithm algorithm;
+private final Key key;
+private final byte[] iv;
+private final Cipher cipher;
+private ByteBuffer decrypted;
+
+EncryptionState(String name, StreamOptions options) {
+  this.name = name;
+  algorithm = options.algorithm;
+  key = options.key;
+  iv = options.iv;
+  cipher = algorithm.createCipher();
+}
+
+/**
+ * We are seeking to a new range, so update the cipher to change the IV
+ * to match. This code assumes that we only support encryption in CTR 
mode.
+ * @param offset where we are seeking to in the stream
+ */
+void changeIv(long offset) {
+  int blockSize = cipher.getBlockSize();
+  long encryptionBlocks = offset / blockSize;
--- End diff --

No, we always start decryption on a block boundary and throw away the 
leading bytes. See InStream.EncryptionState.changeIv(long offset) and the 
"extra" and "wasted" variables.


---


[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...

2018-07-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/278#discussion_r202433617
  
--- Diff: java/core/src/java/org/apache/orc/impl/CryptoUtils.java ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.orc.EncryptionAlgorithm;
+import java.security.SecureRandom;
+
+/**
+ * This class has routines to work with encryption within ORC files.
+ */
+public class CryptoUtils {
+
+  private static final int COLUMN_ID_LENGTH = 3;
+  private static final int KIND_LENGTH = 2;
+  private static final int STRIPE_ID_LENGTH = 3;
+  private static final int MIN_COUNT_BYTES = 8;
+
+  private static final SecureRandom random = new SecureRandom();
--- End diff --

oops. Removed.


---


[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...

2018-07-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/278#discussion_r202433445
  
--- Diff: java/core/src/java/org/apache/orc/impl/InStream.java ---
@@ -55,112 +60,137 @@ public long getStreamLength() {
   @Override
   public abstract void close();
 
+  static int getRangeNumber(DiskRangeList list, DiskRangeList current) {
+int result = 0;
+DiskRangeList range = list;
+while (range != null && range != current) {
+  result += 1;
+  range = range.next;
+}
+return result;
+  }
+
+  /**
+   * Implements a stream over an uncompressed stream.
+   */
   public static class UncompressedStream extends InStream {
-private List bytes;
+private DiskRangeList bytes;
 private long length;
 protected long currentOffset;
-private ByteBuffer range;
-private int currentRange;
+protected ByteBuffer decrypted;
+protected DiskRangeList currentRange;
+
+/**
+ * Create the stream without calling reset on it.
+ * This is used for the subclass that needs to do more setup.
+ * @param name name of the stream
+ * @param length the number of bytes for the stream
+ */
+public UncompressedStream(String name, long length) {
+  super(name, length);
+}
 
-public UncompressedStream(String name, List input, long 
length) {
+public UncompressedStream(String name,
+  DiskRangeList input,
+  long length) {
   super(name, length);
   reset(input, length);
 }
 
-protected void reset(List input, long length) {
+protected void reset(DiskRangeList input, long length) {
   this.bytes = input;
   this.length = length;
-  currentRange = 0;
-  currentOffset = 0;
-  range = null;
+  currentOffset = input == null ? 0 : input.getOffset();
+  setCurrent(input, true);
 }
 
 @Override
 public int read() {
-  if (range == null || range.remaining() == 0) {
+  if (decrypted == null || decrypted.remaining() == 0) {
 if (currentOffset == length) {
   return -1;
 }
-seek(currentOffset);
+setCurrent(currentRange.next, false);
   }
   currentOffset += 1;
-  return 0xff & range.get();
+  return 0xff & decrypted.get();
+}
+
+protected void setCurrent(DiskRangeList newRange, boolean isJump) {
+  currentRange = newRange;
+  if (newRange != null) {
+decrypted = newRange.getData().slice();
+decrypted.position((int) (currentOffset - newRange.getOffset()));
+  }
 }
 
 @Override
 public int read(byte[] data, int offset, int length) {
-  if (range == null || range.remaining() == 0) {
+  if (decrypted == null || decrypted.remaining() == 0) {
 if (currentOffset == this.length) {
   return -1;
 }
-seek(currentOffset);
+setCurrent(currentRange.next, false);
   }
-  int actualLength = Math.min(length, range.remaining());
-  range.get(data, offset, actualLength);
+  int actualLength = Math.min(length, decrypted.remaining());
+  decrypted.get(data, offset, actualLength);
   currentOffset += actualLength;
   return actualLength;
 }
 
 @Override
 public int available() {
-  if (range != null && range.remaining() > 0) {
-return range.remaining();
+  if (decrypted != null && decrypted.remaining() > 0) {
+return decrypted.remaining();
   }
   return (int) (length - currentOffset);
 }
 
 @Override
 public void close() {
-  currentRange = bytes.size();
+  currentRange = null;
   currentOffset = length;
   // explicit de-ref of bytes[]
-  bytes.clear();
+  decrypted = null;
+  bytes = null;
 }
 
 @Override
 public void seek(PositionProvider index) throws IOException {
   seek(index.getNext());
 }
 
-public void seek(long desired) {
-  if (desired == 0 && bytes.isEmpty()) {
+public void seek(long desired) throws IOException {
+  if (desired == 0 && bytes == null) {
 return;
   }
-  int i = 0;
-  for (DiskRange curRange : bytes) {
-if (curRange.getOffset() <= desired &&
-(desired - curRange.getOffset()) < curRange.getLength()) {
-  currentOffset = desired;
-  curren

[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...

2018-07-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/278#discussion_r202431172
  
--- Diff: java/core/src/java/org/apache/orc/impl/InStream.java ---
@@ -55,112 +60,137 @@ public long getStreamLength() {
   @Override
   public abstract void close();
 
+  static int getRangeNumber(DiskRangeList list, DiskRangeList current) {
+int result = 0;
+DiskRangeList range = list;
+while (range != null && range != current) {
+  result += 1;
+  range = range.next;
+}
+return result;
+  }
+
+  /**
+   * Implements a stream over an uncompressed stream.
+   */
   public static class UncompressedStream extends InStream {
-private List bytes;
+private DiskRangeList bytes;
 private long length;
 protected long currentOffset;
-private ByteBuffer range;
-private int currentRange;
+protected ByteBuffer decrypted;
+protected DiskRangeList currentRange;
+
+/**
+ * Create the stream without calling reset on it.
+ * This is used for the subclass that needs to do more setup.
+ * @param name name of the stream
+ * @param length the number of bytes for the stream
+ */
+public UncompressedStream(String name, long length) {
+  super(name, length);
+}
 
-public UncompressedStream(String name, List input, long 
length) {
+public UncompressedStream(String name,
+  DiskRangeList input,
+  long length) {
   super(name, length);
   reset(input, length);
 }
 
-protected void reset(List input, long length) {
+protected void reset(DiskRangeList input, long length) {
   this.bytes = input;
   this.length = length;
-  currentRange = 0;
-  currentOffset = 0;
-  range = null;
+  currentOffset = input == null ? 0 : input.getOffset();
+  setCurrent(input, true);
 }
 
 @Override
 public int read() {
-  if (range == null || range.remaining() == 0) {
+  if (decrypted == null || decrypted.remaining() == 0) {
 if (currentOffset == length) {
   return -1;
 }
-seek(currentOffset);
+setCurrent(currentRange.next, false);
   }
   currentOffset += 1;
-  return 0xff & range.get();
+  return 0xff & decrypted.get();
+}
+
+protected void setCurrent(DiskRangeList newRange, boolean isJump) {
+  currentRange = newRange;
+  if (newRange != null) {
+decrypted = newRange.getData().slice();
+decrypted.position((int) (currentOffset - newRange.getOffset()));
--- End diff --

But I'll add a comment to the code.


---


[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...

2018-07-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/278#discussion_r202430812
  
--- Diff: java/core/src/java/org/apache/orc/impl/InStream.java ---
@@ -55,112 +60,137 @@ public long getStreamLength() {
   @Override
   public abstract void close();
 
+  static int getRangeNumber(DiskRangeList list, DiskRangeList current) {
+int result = 0;
+DiskRangeList range = list;
+while (range != null && range != current) {
+  result += 1;
+  range = range.next;
+}
+return result;
+  }
+
+  /**
+   * Implements a stream over an uncompressed stream.
+   */
   public static class UncompressedStream extends InStream {
-private List bytes;
+private DiskRangeList bytes;
 private long length;
 protected long currentOffset;
-private ByteBuffer range;
-private int currentRange;
+protected ByteBuffer decrypted;
+protected DiskRangeList currentRange;
+
+/**
+ * Create the stream without calling reset on it.
+ * This is used for the subclass that needs to do more setup.
+ * @param name name of the stream
+ * @param length the number of bytes for the stream
+ */
+public UncompressedStream(String name, long length) {
+  super(name, length);
+}
 
-public UncompressedStream(String name, List input, long 
length) {
+public UncompressedStream(String name,
+  DiskRangeList input,
+  long length) {
   super(name, length);
   reset(input, length);
 }
 
-protected void reset(List input, long length) {
+protected void reset(DiskRangeList input, long length) {
   this.bytes = input;
   this.length = length;
-  currentRange = 0;
-  currentOffset = 0;
-  range = null;
+  currentOffset = input == null ? 0 : input.getOffset();
+  setCurrent(input, true);
 }
 
 @Override
 public int read() {
-  if (range == null || range.remaining() == 0) {
+  if (decrypted == null || decrypted.remaining() == 0) {
 if (currentOffset == length) {
   return -1;
 }
-seek(currentOffset);
+setCurrent(currentRange.next, false);
   }
   currentOffset += 1;
-  return 0xff & range.get();
+  return 0xff & decrypted.get();
+}
+
+protected void setCurrent(DiskRangeList newRange, boolean isJump) {
+  currentRange = newRange;
+  if (newRange != null) {
+decrypted = newRange.getData().slice();
+decrypted.position((int) (currentOffset - newRange.getOffset()));
--- End diff --

No, it is moving the ByteBuffer position to the correct spot so that it 
matches currentOffset, which is relative to the stream.


---


[GitHub] orc issue #290: ORC-386 Add spark benchmarks.

2018-07-13 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/290
  
@dongjoon-hyun if you have feedback on these benchmarks, that would be 
great.


---


[GitHub] orc pull request #290: ORC-386 Add spark benchmarks.

2018-07-11 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/290

ORC-386 Add spark benchmarks.

I also refactored all of the old benchmarks to reduce the common code. I 
also split it into three modules so that I could
separate the common code, the code that depends on hive, and the code that 
depends on spark. Avoiding building an uber
jar that has both hive and spark made life much easier.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-386

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/290.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #290


commit 1300f0ce20359e837d5e75bc49c4acc3ee5c7221
Author: Owen O'Malley 
Date:   2018-06-09T21:37:31Z

ORC-386 Add spark benchmarks.

I also refactored all of the old benchmarks to reduce the common code. I 
also split it into three modules so that I could
separate the common code, the code that depends on hive, and the code that 
depends on spark. Avoiding building an uber
jar that has both hive and spark made life much easier.




---


[GitHub] orc issue #281: ORC-375: Update libhdfs to compile with gcc7.

2018-06-06 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/281
  
I accidentally pushed the temporary commit to branch-1.5. I've done a force 
push to undo the change.


---


[GitHub] orc pull request #281: ORC-375: Update libhdfs to compile with gcc7.

2018-06-06 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/281

ORC-375: Update libhdfs to compile with gcc7.

This changes the code in the libhdfs tarball, but we need this change 
applied upstream.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-375

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/281.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #281


commit 497f8698f53148ed2cd7e7753a3d9e31e4e13389
Author: Owen O'Malley 
Date:   2018-06-06T05:17:16Z

ORC-375: Fix libhdfs on gcc7 by adding #include  two places.




---


[GitHub] orc issue #280: ORC-376: Add Ubuntu18 docker file

2018-06-06 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/280
  
I'll also comment that the default openjdk is 10, which does not work. I've 
set the docker file to use openjdk 8 instead, which does work. With openjdk 10 
maven crashes with a NullPointerException trying to compile ORC.


---


[GitHub] orc pull request #280: ORC-376: Add Ubuntu18 docker file

2018-06-06 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/280

ORC-376: Add Ubuntu18 docker file

We need to be able to test gcc 7 and ubuntu 18.04.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-376

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/280.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #280


commit 82110ce6408fe764d95a7dcaf0145f108660afe2
Author: Owen O'Malley 
Date:   2018-06-06T14:18:42Z

ORC-376: Add Ubuntu18 docker file




---


[GitHub] orc pull request #277: ORC-372: Enable valgrind for C++ travis-ci tests

2018-06-05 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/277#discussion_r193106126
  
--- Diff: c++/test/CMakeLists.txt ---
@@ -64,7 +64,12 @@ target_link_libraries (create-test-files
   protobuf
 )
 
-add_test (NAME orc-test COMMAND orc-test)
+if (TEST_VALGRIND_MEMCHECK)
+  add_test (orc-test  
--- End diff --

You have trailing spaces here.


---


[GitHub] orc pull request #277: ORC-372: Enable valgrind for C++ travis-ci tests

2018-06-05 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/277#discussion_r19313
  
--- Diff: c++/src/Compression.cc ---
@@ -199,13 +199,16 @@ namespace orc {
   uint64_t blockSize,
   MemoryPool& pool);
 
+~ZlibCompressionStream() { end(); }
--- End diff --

You need an override on this or the mac complains.


---


[GitHub] orc issue #279: ORC-373: Option to disable dictionary encoding

2018-06-04 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/279
  
Prasanth, rather than adding the flush count, which is really test code in 
the main source, how about creating a StringTreeWriter and catching the stream 
directly. I'd propose this - 
https://github.com/omalley/orc/blob/8a782d204f071e2bd16bcfd591a54319fadff132/java/core/src/test/org/apache/orc/TestStringDictionary.java#L251


---


[GitHub] orc issue #276: ORC-373: Option to disable dictionary encoding

2018-06-04 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/276
  
@prasanthj yeah, in general you should make a dev branch for the pull 
request (eg. prasanthj:orc-373).  Then you can do a force-push to the branch to 
update the PR.


---


[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...

2018-06-03 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/278

ORC-251: Extend InStream and OutStream to support encryption.

This patch:
* Adds a method to Codec to get the CompressionKind.
* Creates StreamOptions for both InStream and OutStream to gather together 
the parameters they need.
* Extends InStream and OutStream to handle encryption.
* Changes InStream to use DiskRangeList instead of List.
* Creates CryptoUtils with a method to create an IV based on the stream 
name.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-251

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/278.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #278


commit ff620c3faccbabc7e011e245dfb3bcdbfef41b7a
Author: Owen O'Malley 
Date:   2018-05-09T16:36:28Z

ORC-251: Extend InStream and OutStream to support encryption.




---


[GitHub] orc pull request #272: ORC-367: Fix incorrect reads of boolean columns after...

2018-05-22 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/272

ORC-367: Fix incorrect reads of boolean columns after a seek.

Fix #272

Signed-off-by: Owen O'Malley <omal...@apache.org>

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-367

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/272.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #272


commit b0504f55e4a68de0c73b89ab5bccb2a673efe9b4
Author: Owen O'Malley <omalley@...>
Date:   2018-05-22T21:43:06Z

ORC-367: Fix incorrect reads of boolean columns after a seek.

Fix #272

Signed-off-by: Owen O'Malley <omal...@apache.org>




---


[GitHub] orc issue #268: ORC-363 Enable zstd decompression in ORC Java reader

2018-05-16 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/268
  
I'd rather hold off on this until we can do an end-to-end test.


---


[GitHub] orc pull request #269: ORC-116. Add examples of reading and writing ORC file...

2018-05-15 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/269

ORC-116. Add examples of reading and writing ORC files from Java.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-116

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/269.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #269


commit e7231cd197f220877a5086b2bd516fc367081b24
Author: Owen O'Malley <omalley@...>
Date:   2016-12-06T17:34:55Z

ORC-116. Add examples of reading and writing ORC files from Java.

Signed-off-by: Owen O'Malley <omal...@apache.org>




---


[GitHub] orc pull request #267: ORC-364. Minor javadoc fixup.

2018-05-14 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/267

ORC-364. Minor javadoc fixup.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-364

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/267.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #267


commit a8565edad13700d597c239faab1992d2b6fa1303
Author: Owen O'Malley <omalley@...>
Date:   2018-05-07T18:08:11Z

Preparing for ORC 1.5.0 release.

Signed-off-by: Owen O'Malley <omal...@apache.org>

commit 249781a518faa3afc82dd84821ddb227d0badabe
Author: Owen O'Malley <omalley@...>
Date:   2018-05-14T21:00:59Z

ORC-364. Minor fixups on javadoc.

Signed-off-by: Owen O'Malley <omal...@apache.org>




---


[GitHub] orc pull request #265: ORC-334: [C++] Add AppVeyor support for integration o...

2018-05-07 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/265#discussion_r186474701
  
--- Diff: c++/src/Timezone.cc ---
@@ -710,7 +710,11 @@ namespace orc {
* Get the local timezone.
*/
   const Timezone& getLocalTimezone() {
+#ifdef _MSC_VER
+return getTimezoneByName("UTC");
--- End diff --

Because there isn't a direct windows equivalent.

Here is what Java writes about it-

https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/time-zone002.html

There is a registry setting, but it uses a non-standard set of names *doh*. 
The easiest way would be to create a Java jar and run that to determine the 
local timezone.


---


[GitHub] orc pull request #266: ORC-360. Add Java error checking on the types when re...

2018-05-07 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/266

ORC-360. Add Java error checking on the types when reading.

This patch adds a check on the types in the footer to ensure that the type 
is properly formed before we start building the tree.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-360

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/266.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #266


commit f236cf727cc295e93ae1e2347be60562fff17fa8
Author: Owen O'Malley <omalley@...>
Date:   2018-05-07T16:19:17Z

ORC-360. Add Java error checking on the types when reading.




---


[GitHub] orc pull request #249: [ORC-341] Support time zone as a parameter for Java r...

2018-05-04 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/249#discussion_r186137062
  
--- Diff: java/core/src/java/org/apache/orc/impl/writer/WriterImplV2.java 
---
@@ -373,7 +379,11 @@ private void flushStripe() throws IOException {
   OrcProto.StripeFooter.Builder builder =
   OrcProto.StripeFooter.newBuilder();
   if (writeTimeZone) {
-builder.setWriterTimezone(TimeZone.getDefault().getID());
+if (useUTCTimeZone) {
+  builder.setWriterTimezone(TimeZone.getTimeZone("UTC").getID());
--- End diff --

I'd be tempted to just use setWriterTimezone("UTC"), because we'll already 
fail if UTC is called something else.


---


[GitHub] orc pull request #249: [ORC-341] Support time zone as a parameter for Java r...

2018-05-04 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/249#discussion_r186133762
  
--- Diff: java/core/src/java/org/apache/orc/OrcFile.java ---
@@ -320,6 +321,16 @@ public ReaderOptions fileMetadata(final FileMetadata 
metadata) {
 public FileMetadata getFileMetadata() {
   return fileMetadata;
 }
+
+public ReaderOptions useUTCTimestamp(boolean value) {
+  useUTCTimestamp = value;
+  return this;
+}
+
+public boolean isUseUTCTimestamp() {
--- End diff --

This should be getUseUTCTimestamp.


---


[GitHub] orc pull request #249: [ORC-341] Support time zone as a parameter for Java r...

2018-05-04 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/249#discussion_r186134015
  
--- Diff: java/core/src/java/org/apache/orc/OrcFile.java ---
@@ -320,6 +321,16 @@ public ReaderOptions fileMetadata(final FileMetadata 
metadata) {
 public FileMetadata getFileMetadata() {
   return fileMetadata;
 }
+
+public ReaderOptions useUTCTimestamp(boolean value) {
--- End diff --

This should also cause the TimestampStatistics to use UTC.


---


[GitHub] orc pull request #249: [ORC-341] Support time zone as a parameter for Java r...

2018-05-04 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/249#discussion_r186134142
  
--- Diff: java/core/src/java/org/apache/orc/OrcFile.java ---
@@ -761,6 +782,10 @@ public boolean getWriteVariableLengthBlocks() {
 public HadoopShims getHadoopShims() {
   return shims;
 }
+
+public boolean isUseUTCTimestamp() {
--- End diff --

Rename this to getUseUTCTimestamp.


---


[GitHub] orc pull request #249: [ORC-341] Support time zone as a parameter for Java r...

2018-05-04 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/249#discussion_r186140770
  
--- Diff: 
java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java ---
@@ -54,9 +57,20 @@ public TimestampTreeWriter(int columnId,
 if (rowIndexPosition != null) {
   recordPosition(rowIndexPosition);
 }
-this.localTimezone = TimeZone.getDefault();
-// for unit tests to set different time zones
-this.baseEpochSecsLocalTz = 
Timestamp.valueOf(BASE_TIMESTAMP_STRING).getTime() / MILLIS_PER_SECOND;
+if (writer.isUseUTCTimestamp()) {
+  this.localTimezone = TimeZone.getTimeZone("UTC");
+} else {
+  this.localTimezone = TimeZone.getDefault();
+}
+this.localDateFormat = new SimpleDateFormat("-MM-dd HH:mm:ss");
--- End diff --

It sucks that there isn't a simpler way in Java to do this.


---


[GitHub] orc pull request #265: ORC-334: [C++] Add AppVeyor support for integration o...

2018-05-04 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/265#discussion_r186114158
  
--- Diff: c++/test/CreateTestFiles.cc ---
@@ -31,8 +31,8 @@
 void writeCustomOrcFile(const std::string& filename,
 const orc::proto::Metadata& metadata,
 const orc::proto::Footer& footer,
-const std::vector& version,
-uint writerVersion) {
+const std::vector& version,
--- End diff --

We generally prefer to use explicitly sized types, so uint64_t or uint32_t 
would be better.


---


[GitHub] orc issue #252: ORC-345 Create and use a new Decimal64StatisticsImpl on the ...

2018-05-03 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/252
  
I should comment while adding the tests that it became clear that I didn't 
need the sign change logic, because (10**18 - 1 )* 2 < 2**63. So by enforcing 
the 10**18 limit, I didn't need the more complicated sign change detection.


---


[GitHub] orc pull request #252: ORC-345 Create and use a new Decimal64StatisticsImpl ...

2018-05-03 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/252#discussion_r185846011
  
--- Diff: java/bench/src/java/org/apache/orc/bench/DecimalBench.java ---
@@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.bench;
+
+import com.google.gson.JsonStreamParser;
+import org.apache.avro.file.DataFileReader;
+import org.apache.avro.generic.GenericDatumReader;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.io.DatumReader;
+import org.apache.avro.mapred.FsInput;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.TrackingLocalFileSystem;
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.Decimal64ColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport;
+import 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcFile;
+import org.apache.orc.Reader;
+import org.apache.orc.RecordReader;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.apache.orc.bench.convert.BatchReader;
+import org.apache.orc.bench.convert.GenerateVariants;
+import org.apache.orc.bench.convert.csv.CsvReader;
+import org.apache.parquet.hadoop.ParquetInputFormat;
+import org.openjdk.jmh.annotations.AuxCounters;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.annotations.Warmup;
+import org.openjdk.jmh.infra.Blackhole;
+import org.openjdk.jmh.runner.Runner;
+import org.openjdk.jmh.runner.options.OptionsBuilder;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URI;
+import java.nio.charset.StandardCharsets;
+import java.util.concurrent.TimeUnit;
+
+@BenchmarkMode(Mode.AverageTime)
+@Warmup(iterations=2, time=30, timeUnit = TimeUnit.SECONDS)
+@Measurement(iterations=10, time=30, timeUnit = TimeUnit.SECONDS)
+@State(Scope.Thread)
+@OutputTimeUnit(TimeUnit.MICROSECONDS)
+@Fork(2)
+public class DecimalBench {
+
+  private static final String ROOT_ENVIRONMENT_NAME = "bench.root.dir";
+  private static final Path root;
+  static {
+String value = System.getProperty(ROOT_ENVIRONMENT_NAME);
+root = value == null ? null : new Path(value);
+  }
+
+  /**
+   * Abstract out whether we are writing short or long decimals
+   */
+  interface Loader {
+/**
+ * Load the data from the values array into the ColumnVector.
+ * @param vector the output
+ * @param values the intput
+ * @param offset the first input value
+ * @param length the number of values to copy
+ */
+void loadData(ColumnVector vector, long[] values, int offset, int 
length);
+  }
+
+  static class Decimal64Loader implements Loa

[GitHub] orc pull request #252: ORC-345 Create and use a new Decimal64StatisticsImpl ...

2018-05-03 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/252#discussion_r185845484
  
--- Diff: java/bench/src/java/org/apache/orc/bench/DecimalBench.java ---
@@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.bench;
+
+import com.google.gson.JsonStreamParser;
+import org.apache.avro.file.DataFileReader;
+import org.apache.avro.generic.GenericDatumReader;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.io.DatumReader;
+import org.apache.avro.mapred.FsInput;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.TrackingLocalFileSystem;
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.Decimal64ColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport;
+import 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcFile;
+import org.apache.orc.Reader;
+import org.apache.orc.RecordReader;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.apache.orc.bench.convert.BatchReader;
+import org.apache.orc.bench.convert.GenerateVariants;
+import org.apache.orc.bench.convert.csv.CsvReader;
+import org.apache.parquet.hadoop.ParquetInputFormat;
+import org.openjdk.jmh.annotations.AuxCounters;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.annotations.Warmup;
+import org.openjdk.jmh.infra.Blackhole;
+import org.openjdk.jmh.runner.Runner;
+import org.openjdk.jmh.runner.options.OptionsBuilder;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URI;
+import java.nio.charset.StandardCharsets;
+import java.util.concurrent.TimeUnit;
+
+@BenchmarkMode(Mode.AverageTime)
+@Warmup(iterations=2, time=30, timeUnit = TimeUnit.SECONDS)
+@Measurement(iterations=10, time=30, timeUnit = TimeUnit.SECONDS)
+@State(Scope.Thread)
+@OutputTimeUnit(TimeUnit.MICROSECONDS)
+@Fork(2)
+public class DecimalBench {
+
+  private static final String ROOT_ENVIRONMENT_NAME = "bench.root.dir";
+  private static final Path root;
+  static {
+String value = System.getProperty(ROOT_ENVIRONMENT_NAME);
+root = value == null ? null : new Path(value);
+  }
+
+  /**
+   * Abstract out whether we are writing short or long decimals
+   */
+  interface Loader {
+/**
+ * Load the data from the values array into the ColumnVector.
+ * @param vector the output
+ * @param values the intput
+ * @param offset the first input value
+ * @param length the number of values to copy
+ */
+void loadData(ColumnVector vector, long[] values, int offset, int 
length);
+  }
+
+  static class Decimal64Loader implements Loa

[GitHub] orc pull request #252: ORC-345 Create and use a new Decimal64StatisticsImpl ...

2018-05-03 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/252#discussion_r185845008
  
--- Diff: java/bench/src/java/org/apache/orc/bench/DecimalBench.java ---
@@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.bench;
+
+import com.google.gson.JsonStreamParser;
+import org.apache.avro.file.DataFileReader;
+import org.apache.avro.generic.GenericDatumReader;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.io.DatumReader;
+import org.apache.avro.mapred.FsInput;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.TrackingLocalFileSystem;
+import org.apache.hadoop.hive.ql.exec.vector.ColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.Decimal64ColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport;
+import 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcFile;
+import org.apache.orc.Reader;
+import org.apache.orc.RecordReader;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.apache.orc.bench.convert.BatchReader;
+import org.apache.orc.bench.convert.GenerateVariants;
+import org.apache.orc.bench.convert.csv.CsvReader;
+import org.apache.parquet.hadoop.ParquetInputFormat;
+import org.openjdk.jmh.annotations.AuxCounters;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.annotations.Warmup;
+import org.openjdk.jmh.infra.Blackhole;
+import org.openjdk.jmh.runner.Runner;
+import org.openjdk.jmh.runner.options.OptionsBuilder;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.net.URI;
+import java.nio.charset.StandardCharsets;
+import java.util.concurrent.TimeUnit;
+
+@BenchmarkMode(Mode.AverageTime)
+@Warmup(iterations=2, time=30, timeUnit = TimeUnit.SECONDS)
+@Measurement(iterations=10, time=30, timeUnit = TimeUnit.SECONDS)
+@State(Scope.Thread)
+@OutputTimeUnit(TimeUnit.MICROSECONDS)
+@Fork(2)
+public class DecimalBench {
+
+  private static final String ROOT_ENVIRONMENT_NAME = "bench.root.dir";
+  private static final Path root;
+  static {
+String value = System.getProperty(ROOT_ENVIRONMENT_NAME);
+root = value == null ? null : new Path(value);
+  }
+
+  /**
+   * Abstract out whether we are writing short or long decimals
+   */
+  interface Loader {
+/**
+ * Load the data from the values array into the ColumnVector.
+ * @param vector the output
+ * @param values the intput
+ * @param offset the first input value
+ * @param length the number of values to copy
+ */
+void loadData(ColumnVector vector, long[] values, int offset, int 
length);
+  }
+
+  static class Decimal64Loader implements Loa

[GitHub] orc issue #252: ORC-345 Create and use a new Decimal64StatisticsImpl on the ...

2018-05-02 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/252
  
Ok, with this and ORC-344, the difference in writing small vs large 
decimals to a null file system is huge:

Benchmark   (precision)  Mode  Cnt   Score  Error  Units
DecimalBench.write8  avgt   30   36914.791 ±  866.408  us/op
DecimalBench.write   19  avgt   30  240789.318 ± 5146.880  us/op

So I'm seeing a 6x speed up when using precision = 8 with the new code path 
instead of p = 19 and the old code path.


---


[GitHub] orc issue #252: ORC-345 Create and use a new Decimal64StatisticsImpl on the ...

2018-05-02 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/252
  
Ok, I update this pull request:
* I changed updateDecimal64 to take a value and scale.
* I check for more overflows in Decimal64ColumnStatisticsImpl.
* I added a test that tests the new code.


---


[GitHub] orc pull request #262: ORC-354. Restore the benchmark module.

2018-05-01 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/262

ORC-354. Restore the benchmark module.

This reverts commit b86d70aa73289b86e066cc019ea11e0d83c1e40d.

Signed-off-by: Owen O'Malley <omal...@apache.org>

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-354

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/262.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #262


commit a97d6ca526be64da507ffa0cdb3324a95ec1c896
Author: Owen O'Malley <omalley@...>
Date:   2018-04-27T23:58:48Z

ORC-354. Restore the benchmark module.

This reverts commit b86d70aa73289b86e066cc019ea11e0d83c1e40d.

Signed-off-by: Owen O'Malley <omal...@apache.org>




---


[GitHub] orc issue #257: ORC-49. Create RLE-based encoding for short decimals in ORCv...

2018-04-27 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/257
  
Part of what we absolutely need in RLEv3 is zero suppression for the whole 
row batch. So that if you have integers like:

- 1000
- 2000
- 3000
- 1
- 4000
- 1000

it will record 3 for the number of zeros and encode:

- 1
- 2
- 3
- 10
- 4
- 1



---


[GitHub] orc issue #253: ORC-346: [C++] Add one second when writing negative Timestam...

2018-04-27 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/253
  
*Sigh* This is caused by the fact that Java defined Timestamp badly and 
then Hive copied those semantics into the ColumnVector.

We really need to fix these semantics in ORCv2.

I'll also comment that it would probably be nicer to our users if we 
weren't modifying the data in the columnVector they passed into us, but clearly 
we are already doing that.


---


[GitHub] orc pull request #257: ORC-49. Create RLE-based encoding for short decimals ...

2018-04-27 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/257

ORC-49. Create RLE-based encoding for short decimals in ORCv2.

Create the first pass of the new decimal encoding for ORCv2. For small 
decimals, it uses RLEv2 and the old format for long decimals.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-49

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/257.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #257


commit 0f50ad4a22e19ee06a97b6289a01dbc5671268c5
Author: Owen O'Malley <omalley@...>
Date:   2018-04-27T05:13:26Z

ORC-49. Create RLE-based encoding for short decimals in ORCv2.




---


[GitHub] orc pull request #252: ORC-345 Create and use a new Decimal64StatisticsImpl ...

2018-04-18 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/252

ORC-345 Create and use a new Decimal64StatisticsImpl on the write path

Currently the DecimalTreeWriter spends a lot of time maintaining the 
statistics for decimal objects. For short decimals (p <= 18) we can use longs 
to track the min, max, and sum.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-345

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/252.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #252


commit 12c37d717010e5b683074633b273b0bad0180e05
Author: Owen O'Malley <omalley@...>
Date:   2018-04-18T03:16:12Z

ORC-344. Support the new Decimal64ColumnVector.

Fixes #250

Signed-off-by: Owen O'Malley <omal...@apache.org>

commit e7c395ed2e5a68e480e31c4ba6d74237e5025a33
Author: Owen O'Malley <omalley@...>
Date:   2018-04-18T21:30:35Z

ORC-345. Create Decimal64StatisticsImpl.




---


[GitHub] orc pull request #:

2018-04-18 Thread omalley
Github user omalley commented on the pull request:


https://github.com/apache/orc/commit/240931203c6a1c7e43c010b3bfa8f6953cc92f1e#commitcomment-28641687
  
I did put a link in at the top of https://orc.apache.org/develop/

It felt more like a developer artifact that a user artifact, although 
adding a link somewhere in the documentation would be good.


---


[GitHub] orc pull request #250: ORC-344. Support the new Decimal64ColumnVector.

2018-04-17 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/250

ORC-344. Support the new Decimal64ColumnVector.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-344

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/250.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #250


commit a6b71376be6ace6b39bd9e2de63989c9b7b031ac
Author: Owen O'Malley <omalley@...>
Date:   2018-04-18T03:16:12Z

ORC-344. Support the new Decimal64ColumnVector.

Signed-off-by: Owen O'Malley <omal...@apache.org>




---


[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...

2018-04-17 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/249
  
Also note that the C++ reader already uses UTC for its 
TimestampColumnVector. :)


---


[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...

2018-04-17 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/249
  
@jcamachor I'd suggest a much simpler API:

- Instead of passing in the reader timezone, make a boolean option to 
useUtcForTimestamp.
- Extend TimestampColumnVector to have a boolean isUTC field.
- The TimestampTreeWriter can use the isUTC in the ColumnVector to 
determine if it is in UTC.
- The reader can set isUTC appropriately based on the option.

Thoughts?


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-17 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r182139793
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
+   * 64 and 128 bit variants
--- End diff --

Just to clarify the answer "variant" means "one of the instances of 
something that varies." The unfortunately similar looking "varint" is an 
unofficial contraction of "variable length integer."


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-17 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r182138208
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
+   * 64 and 128 bit variants
--- End diff --

No, I mean that the new RLEv3 needs to support both 64 bit and 128 bit 
integers. The specification of course could just define the spec using 128 
bits, but the Java and C++ implementation would likely be split.


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-17 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r182137684
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
--- End diff --

I have some thoughts, but haven't done any work on it.


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-17 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r182137566
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
--- End diff --

This is absolutely a work in progress. I'd love to see your proposal.


---


[GitHub] orc pull request #248: ORC-337. Move to storage-api 2.5.0.

2018-04-13 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/248

ORC-337. Move to storage-api 2.5.0.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-337

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/248.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #248


commit 74f11f8e0c5f44d0cd6a8a5218d151aabd3149c2
Author: Owen O'Malley <omalley@...>
Date:   2018-04-13T23:16:32Z

ORC-337. Move to storage-api 2.5.0.

Signed-off-by: Owen O'Malley <omal...@apache.org>




---


[GitHub] orc issue #247: ORC-339. Reorganize the ORC file format specification.

2018-04-13 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/247
  
Any other comments?


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181525137
  
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
 Decimal was introduced in Hive 0.11 with infinite precision (the total
 number of digits). In Hive 0.13, the definition was change to limit
 the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.
+bits plus a sign bit.
+
+DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
+representation of the value as an unbounded length zigzag encoded base
+128 varint. The scale is stored in the SECONDARY stream as an signed
+integer.
+
+In ORC 2.0, DECIMAL_V1 and DECIMAL_V2 encodins are introduced and
--- End diff --

In ORCv2, we'll just pick a RLE and not leave it pickable.

In terms of the encoding names, I'm a bit torn. My original inclination 
would be to use DECIMAL64 and DECIMAL128 as encoding names. However, It would 
be nice to have the ability to use dictionaries, so we'd need dictionary forms 
of them too. Thoughts?


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181456234
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,25 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string representation is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+  message Int128 {
--- End diff --

Let's pull the Int128 out of DecimalStatistics. We will likely use it other 
places.

One concern with this representation is that -1 is pretty painful. You'll 
get highBits = -1, lowBits = -1, which will only take 1 byte for highBits, but 
9 bytes for lowBits (+ the 4 bytes of field identifiers & message length) = 14 
bytes total. Another alternative is to use the zigzag encoding for the combined 
128 bit value:

  optional uint64 minLow = 4;
  optional uint64 minHigh = 5;

p <= 18:
  minLow = zigzag(min)
  minHigh = 0

p > 18:
  minLow = low bits of zigzag(min)
  minHigh = high bits of zigzag(min)

That would have a representation of 1 byte each for minLow and minHigh + 2 
bytes for field identifier = 4. If we leave the Int128 level that would add an 
additional + 2 bytes. 


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181525462
  
--- Diff: site/_docs/encodings.md ---
@@ -122,6 +132,12 @@ DIRECT| PRESENT | Yes  | Boolean 
RLE
 DIRECT_V2 | PRESENT | Yes  | Boolean RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
+DECIMAL_V1| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | Yes  | Signed Integer RLE v1
--- End diff --

We probably need a RLE128 that can just encode a int128 directly. Then we 
can just use the DATA stream directly.


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r181437863
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
+   * 64 and 128 bit variants
+   * Zero suppression
+   * Evaluate the rle subformats
+* Group stripe data into stripelets to enable Async IO for reads.
+* Reorder stripe data into (stripe metadata, index, dictionary, data)
+* Stop sorting dictionaries and record the sort order separately in the 
index.
+* Remove use of RLEv1 and RLEv2.
+* Remove non-utf8 bloom filter.
+* Use numeric value for decimal bloom filter.
--- End diff --

Agreed


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-12 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/247

ORC-339. Reorganize the ORC file format specification.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-339

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/247.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #247


commit 5c56d74d948a73f5c456e0e80ff0622505d6c1cf
Author: Owen O'Malley <omalley@...>
Date:   2018-04-12T22:03:00Z

ORC-339. Reorganize the ORC file format specification.




---


[GitHub] orc pull request #246: ORC-338. Workaround C++ compiler bug in xcode 9.3 by ...

2018-04-12 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/246

ORC-338. Workaround C++ compiler bug in xcode 9.3 by removing an inline

function.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-338

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #246






---


[GitHub] orc pull request #243: Update the site with more information about developin...

2018-04-10 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/243#discussion_r180641849
  
--- Diff: site/develop/committers.md ---
@@ -0,0 +1,62 @@
+---
+layout: page
+title: Project Members
+---
+
+## Project Members
+
+{% comment %}
+please sort by Apache Id
+{% endcomment %}
+Name| Apache Id| Role
+:-- | :--- | :---
+Aliaksei Sandryhaila| asandryh | PMC
+Chris Douglas   | cdouglas | PMC
+Chinna Rao Lalam| chinnaraol   | Committer
+Chaoyu Tang | ctang| Committer
+Carl Steinbach  | cws  | Committer
+Daniel Dai  | daijy| Committer
+Deepak Majeti   | mdeepak  | PMC
+Eugene Koifman  | ekoifman | PMC
+Gang Wu | gangwu   | Committer
+Alan Gates  | gates| PMC
+Gopal Vijayaraghavan| gopalv   | PMC
+Gunther Hagleitner  | gunther  | Committer
+Ashutosh Chauhan| hashutosh| Committer
+Jesus Camacho Rodriguez | jcamacho | Committer
+Jason Dere  | jdere| Committer
+Jimmy Xiang | jxiang   | Committer
+Kevin Wilfong   | kevinwilfong | Committer
+Lars Francke| larsfrancke  | Committer
+Lefty Leverenz  | leftyl   | PMC
+Rui Li  | lirui| Committer
+Mithun Radhakrishnan| mithun   | Committer
+Matthew McCline | mmccline | Committer
+Naveen Gangam   | ngangam  | Committer
+Owen O'Malley   | omalley  | PMC
+Prasanth Jayachandran   | prasanthj| PMC
+Pengcheng Xiong | pxiong   | Committer
+Rajesh Balamohan| rbalamohan   | Committer
+Sergey Shelukhin| sershe   | Committer
+Sergio Pena | spena| Committer
+Siddharth Seth  | sseth| Committer
+Stephen Walkauskas  | swalkaus | Committer
+Vaibhav Gumashta| vgumashta| Committer
+Wei Zheng   | weiz | Committer
+Xiening Dai | xndai| Committer
+Xuefu Zhang | xuefu| Committer
+Ferdinand Xu| xuf  | Committer
+Yongzhi Chen| ychena   | Committer
+Aihua Xu| zihuaxu  | Committer
+
+Companies with employees that are committers:
+
+* Alibaba
+* Cloudera
+* Facebook
+* Hewlett Packard Enterprise
--- End diff --

Done


---


[GitHub] orc pull request #243: Update the site with more information about developin...

2018-04-10 Thread omalley
Github user omalley closed the pull request at:

https://github.com/apache/orc/pull/243


---


[GitHub] orc issue #243: Update the site with more information about developing.

2018-04-10 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/orc/pull/243
  
I moved the committer list to a new page, because it made all of the other 
information get lost below the long list.

I added sections on how to set up git, approve pull requests, create a gpg 
key, and the dist directory.

I also added Xiening Dai and Gang Wu as committers.


---


[GitHub] orc pull request #243: Update the site with more information about developin...

2018-04-10 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/243

Update the site with more information about developing.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc dev-site

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/243.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #243


commit 079a23dc5cd1016a51be6eb5df80226c9ebc5a54
Author: Owen O'Malley <omalley@...>
Date:   2018-04-10T22:36:51Z

Update the site with more information about developing.

Signed-off-by: Owen O'Malley <omal...@apache.org>




---


  1   2   3   >