[GitHub] orc issue #308: Covert tool should create a lowercase schema
Github user omalley commented on the issue: https://github.com/apache/orc/pull/308 Thinking about this more, I'd like to suggest that if the user asks for a lowercase schema that when the fields of a struct are parsed, they are lowercased there. That will prevent "aa" and "AA" both being converted to "aa", but not being merged into a single field. ---
[GitHub] orc issue #312: fix ORC-237 OrcFile.mergeFiles
Github user omalley commented on the issue: https://github.com/apache/orc/pull/312 @yiyezhiqiu233 do you have an Apache jira ID? I'd like to assign ORC-237 to you. ---
[GitHub] orc pull request #315: ORC-411: Update pom file to work with openjdk 10.
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/315 ORC-411: Update pom file to work with openjdk 10. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-411 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/315.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #315 commit 6be929f239dadd18ff35bff9c072fd315008354f Author: Owen O'Malley Date: 2018-09-30T04:58:29Z ORC-411: Update pom file to work with openjdk 10. ---
[GitHub] orc issue #308: Covert tool should create a lowercase schema
Github user omalley commented on the issue: https://github.com/apache/orc/pull/308 Actually, this is doesn't match my experience. Hive doesn't support uppercase letters in the top level column names, but it does support them in sub-structs. Furthermore, ORC from other systems can handle the uppercase letters. You might also try setting "orc.schema.evolution.case.sensitive" to false to get name-based matching of the ORC types that isn't case sensitive. In terms of this patch, can you add a parameter that allows the user to downcase the column names? ---
[GitHub] orc pull request #309: ORC-403: [C++] Add checks to avoid invalid offsets in...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/309#discussion_r218244811 --- Diff: c++/src/Reader.cc --- @@ -498,6 +498,12 @@ namespace orc { const proto::Stream& stream = currentStripeFooter.streams(i); uint64_t length = static_cast(stream.length()); if (static_cast(stream.kind()) == StreamKind::StreamKind_ROW_INDEX) { +if (offset + length > fileLength) { --- End diff -- This check is really good, but it would also be nice to check that the stream is within the stripe, although you'd need to pass that in also. We could pass the whole proto::StripeInformation in. ---
[GitHub] orc pull request #306: ORC-363: Enable zstd for java writer/reader
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/306#discussion_r217801734 --- Diff: java/pom.xml --- @@ -68,8 +68,8 @@ ${project.build.directory}/testing-tmp ${project.basedir}/../../examples -2.2.0 -2.7.3 +2.9.0 --- End diff -- We shouldn't bump the minimum version of Hadoop to 2.9. Let's instead put it into the shims. We'll need to add a method like createZstdCodec to HadoopShims. We'll also need a new implementation for HadoopShimsPre2_9 that handles variable length blocks, but not zstd. You'll also need a graceful message if they don't have the native libraries available. ---
[GitHub] orc issue #299: ORC-203 - Update StringStatistics to trim long strings to 10...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/299 I simplified the code to get the upper bound. I also added 4 test cases to test the different number of bytes in the utf-8 characters. See https://github.com/omalley/orc/tree/orc-203 ---
[GitHub] orc issue #305: ORC-399. Move Java compiler to version 8.
Github user omalley commented on the issue: https://github.com/apache/orc/pull/305 It looks like the Calcite pom were pulling in some jar that is missing from Maven Central. ---
[GitHub] orc pull request #305: ORC-399. Move Java compiler to version 8.
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/305 ORC-399. Move Java compiler to version 8. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-399 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/305.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #305 commit f093118410710c750d84d78d18c7d952254dda09 Author: Owen O'Malley Date: 2018-08-31T22:21:49Z ORC-399. Move Java compiler to version 8. ---
[GitHub] orc issue #299: ORC-203 - Update StringStatistics to trim long strings to 10...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/299 Please make a couple more changes: * Revert the change to java/bench/README.md. * We should add a new WriterVersion (ORC_203) so that if there is a bug in the new code, we can try to recover in the reader. ---
[GitHub] orc pull request #304: ORC-397. Allow selective disabling of dictionary enco...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/304#discussion_r213566036 --- Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java --- @@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary() throws Exception { } + /** + * Test that dictionaries can be disabled, per column. In this test, we want to disable DICTIONARY_V2 for the + * `longString` column (presumably for a low hit-ratio), while preserving DICTIONARY_V2 for `shortString`. + * @throws Exception on unexpected failure + */ + @Test + public void testDisableDictionaryForSpecificColumn() throws Exception { +final String SHORT_STRING_VALUE = "foo"; +final String LONG_STRING_VALUE = "BR!!"; + +TypeDescription schema = + TypeDescription.fromString("struct"); + +Writer writer = OrcFile.createWriter( +testFilePath, +OrcFile.writerOptions(conf).setSchema(schema) +.compress(CompressionKind.NONE) +.bufferSize(1) +.directEncodingColumns("longString")); --- End diff -- We shouldn't change the default behavior of deciding between direct/dictionary encoding based on the data. With the parameter being the columns that are being forced to be direct, the default is the empty string, which is straight-forward. Does that make sense? ---
[GitHub] orc issue #304: ORC-397. Allow selective disabling of dictionary encoding.
Github user omalley commented on the issue: https://github.com/apache/orc/pull/304 I pushed an update that simplified the test case a bit. ---
[GitHub] orc issue #304: ORC-397. Allow selective disabling of dictionary encoding.
Github user omalley commented on the issue: https://github.com/apache/orc/pull/304 Oh, rather than fix WriterImplV2 I simplified it to remove the duplicated code. ---
[GitHub] orc pull request #304: ORC-397. Allow selective disabling of dictionary enco...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/304 ORC-397. Allow selective disabling of dictionary encoding. I've forward ported the patch from Mithun. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-397 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/304.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #304 commit b00a52e668efa9fb06821d7da52706ace81e31c0 Author: Owen O'Malley Date: 2018-08-28T23:12:05Z ORC-397. Allow selective disabling of dictionary encoding. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208394170 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -683,6 +787,150 @@ public int hashCode() { result = 31 * result + (int) (sum ^ (sum >>> 32)); return result; } + +/** + * A helper function that truncates the {@link Text} input + * based on {@link #MAX_BYTES_RECORDED} and increments + * the last codepoint by 1. + * @param text + * @return truncated Text value + */ +private static Text truncateUpperBound(final Text text) { + + if(text.getBytes().length > MAX_BYTES_RECORDED) { +return truncateUpperBound(text.getBytes()); + } else { +return text; + } + +} + +/** + * A helper function that truncates the {@link byte[]} input + * based on {@link #MAX_BYTES_RECORDED} and increments + * the last codepoint by 1. + * @param text + * @return truncated Text value + */ +private static Text truncateUpperBound(final byte[] text) { + if(text.length > MAX_BYTES_RECORDED) { +final Text truncated = truncateLowerBound(text); +final byte[] data = truncated.getBytes(); + +int lastCharPosition = data.length - 1; +int offset = 0; + +/* we don't expect characters more than 5 bytes */ +for (int i = 0; i < 5; i++) { + final byte b = data[lastCharPosition]; + offset = getCharLength(b); + + /* found beginning of a valid char */ + if (offset > 0) { +final byte[] lastCharBytes = Arrays +.copyOfRange(text, lastCharPosition, lastCharPosition + offset); +/* last character */ +final String s = new String(lastCharBytes, Charset.forName("UTF-8")); + +/* increment the codepoint of last character */ +int codePoint = s.codePointAt(s.length() - 1); +codePoint++; +final char[] incrementedChars = Character.toChars(codePoint); + +/* convert char array to byte array */ +final CharBuffer charBuffer = CharBuffer.wrap(incrementedChars); +final ByteBuffer byteBuffer = Charset.forName("UTF-8").encode(charBuffer); +final byte[] bytes = Arrays.copyOfRange(byteBuffer.array(), byteBuffer.position(), +byteBuffer.limit()); + +final byte[] result = new byte[lastCharPosition + bytes.length]; + +/* copy truncated array minus last char */ +System.arraycopy(text, 0, result, 0, lastCharPosition); +/* copy last char */ +System.arraycopy(bytes, 0, result, lastCharPosition, bytes.length); + +return new Text(result); + + } /* not found keep looking for a beginning byte */ else { +--lastCharPosition; + } + +} +/* beginning of a valid char not found */ +throw new IllegalArgumentException( +"Could not truncate string, beginning of a valid char not found"); + } else { +return new Text(text); + } +} + +private static Text truncateLowerBound(final Text text) { + if(text.getBytes().length > MAX_BYTES_RECORDED) { +return truncateLowerBound(text.getBytes()); + } else { +return text; + } +} + + +private static Text truncateLowerBound(final byte[] text) { --- End diff -- You need to pass the length in here as well. The byte array may be longer than the length of the data. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208402722 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -683,6 +787,150 @@ public int hashCode() { result = 31 * result + (int) (sum ^ (sum >>> 32)); return result; } + +/** + * A helper function that truncates the {@link Text} input + * based on {@link #MAX_BYTES_RECORDED} and increments + * the last codepoint by 1. + * @param text + * @return truncated Text value + */ +private static Text truncateUpperBound(final Text text) { + + if(text.getBytes().length > MAX_BYTES_RECORDED) { +return truncateUpperBound(text.getBytes()); + } else { +return text; + } + +} + +/** + * A helper function that truncates the {@link byte[]} input + * based on {@link #MAX_BYTES_RECORDED} and increments + * the last codepoint by 1. + * @param text + * @return truncated Text value + */ +private static Text truncateUpperBound(final byte[] text) { + if(text.length > MAX_BYTES_RECORDED) { +final Text truncated = truncateLowerBound(text); +final byte[] data = truncated.getBytes(); + +int lastCharPosition = data.length - 1; +int offset = 0; + +/* we don't expect characters more than 5 bytes */ +for (int i = 0; i < 5; i++) { + final byte b = data[lastCharPosition]; + offset = getCharLength(b); + + /* found beginning of a valid char */ + if (offset > 0) { +final byte[] lastCharBytes = Arrays +.copyOfRange(text, lastCharPosition, lastCharPosition + offset); +/* last character */ +final String s = new String(lastCharBytes, Charset.forName("UTF-8")); + +/* increment the codepoint of last character */ +int codePoint = s.codePointAt(s.length() - 1); +codePoint++; +final char[] incrementedChars = Character.toChars(codePoint); + +/* convert char array to byte array */ +final CharBuffer charBuffer = CharBuffer.wrap(incrementedChars); +final ByteBuffer byteBuffer = Charset.forName("UTF-8").encode(charBuffer); +final byte[] bytes = Arrays.copyOfRange(byteBuffer.array(), byteBuffer.position(), +byteBuffer.limit()); + +final byte[] result = new byte[lastCharPosition + bytes.length]; + +/* copy truncated array minus last char */ +System.arraycopy(text, 0, result, 0, lastCharPosition); +/* copy last char */ +System.arraycopy(bytes, 0, result, lastCharPosition, bytes.length); + +return new Text(result); --- End diff -- A better pattern is: Text result = new Text(); result.setCapacity(lastCharPosition + bytes.length); result.set(text, 0, lastCharPosition); result.append(bytes, 0, bytes.length); return result; ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208398251 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -584,16 +630,40 @@ public void merge(ColumnStatisticsImpl other) { if (str.minimum != null) { maximum = new Text(str.getMaximum()); minimum = new Text(str.getMinimum()); - } else { + } + /* str.minimum == null when lower bound set */ + else if (str.isLowerBoundSet) { +minimum = new Text(str.getLowerBound()); +isLowerBoundSet = true; + +/* check for upper bound before setting max */ +if (str.isUpperBoundSet) { + maximum = new Text(str.getUpperBound()); + isUpperBoundSet = true; +} else { + maximum = new Text(str.getMaximum()); +} + } + else { /* both are empty */ maximum = minimum = null; } } else if (str.minimum != null) { if (minimum.compareTo(str.minimum) > 0) { -minimum = new Text(str.getMinimum()); +if(str.isLowerBoundSet) { + minimum = new Text(str.getLowerBound()); + isLowerBoundSet = true; +} else { + minimum = new Text(str.getMinimum()); +} } if (maximum.compareTo(str.maximum) < 0) { -maximum = new Text(str.getMaximum()); +if(str.isUpperBoundSet) { --- End diff -- The same simplification would hold here. maximum = new Text(str.maximum) isUpperBoundSet = str.isUpperBoundSet ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208392941 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -543,35 +551,73 @@ public void reset() { super.reset(); minimum = null; maximum = null; + isLowerBoundSet = false; + isUpperBoundSet = false; sum = 0; } @Override public void updateString(Text value) { if (minimum == null) { -maximum = minimum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(value); + maximum = truncateUpperBound(value); + isLowerBoundSet = true; + isUpperBoundSet = true; +} else { + maximum = minimum = new Text(value); +} } else if (minimum.compareTo(value) > 0) { -minimum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(value); + isLowerBoundSet = true; +}else { + minimum = new Text(value); +} } else if (maximum.compareTo(value) < 0) { -maximum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + maximum = truncateUpperBound(value); + isUpperBoundSet = true; +} else { + maximum = new Text(value); --- End diff -- This also needs to clear isUpperBoundSet. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208397080 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -584,16 +630,40 @@ public void merge(ColumnStatisticsImpl other) { if (str.minimum != null) { maximum = new Text(str.getMaximum()); minimum = new Text(str.getMinimum()); - } else { + } --- End diff -- Ok, you should keep the "else" on the same line as the "}", but I don't think you need this new branch. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208392060 --- Diff: java/bench/README.md --- @@ -1,3 +1,20 @@ + --- End diff -- I don't need this, even when I compile with ANALYZE_JAVA. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208401986 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -683,6 +787,150 @@ public int hashCode() { result = 31 * result + (int) (sum ^ (sum >>> 32)); return result; } + +/** + * A helper function that truncates the {@link Text} input + * based on {@link #MAX_BYTES_RECORDED} and increments + * the last codepoint by 1. + * @param text + * @return truncated Text value + */ +private static Text truncateUpperBound(final Text text) { + + if(text.getBytes().length > MAX_BYTES_RECORDED) { +return truncateUpperBound(text.getBytes()); + } else { +return text; + } + +} + +/** + * A helper function that truncates the {@link byte[]} input + * based on {@link #MAX_BYTES_RECORDED} and increments + * the last codepoint by 1. + * @param text + * @return truncated Text value + */ +private static Text truncateUpperBound(final byte[] text) { + if(text.length > MAX_BYTES_RECORDED) { +final Text truncated = truncateLowerBound(text); +final byte[] data = truncated.getBytes(); + +int lastCharPosition = data.length - 1; +int offset = 0; + +/* we don't expect characters more than 5 bytes */ +for (int i = 0; i < 5; i++) { + final byte b = data[lastCharPosition]; + offset = getCharLength(b); + + /* found beginning of a valid char */ + if (offset > 0) { +final byte[] lastCharBytes = Arrays +.copyOfRange(text, lastCharPosition, lastCharPosition + offset); +/* last character */ +final String s = new String(lastCharBytes, Charset.forName("UTF-8")); + +/* increment the codepoint of last character */ +int codePoint = s.codePointAt(s.length() - 1); +codePoint++; +final char[] incrementedChars = Character.toChars(codePoint); + +/* convert char array to byte array */ +final CharBuffer charBuffer = CharBuffer.wrap(incrementedChars); +final ByteBuffer byteBuffer = Charset.forName("UTF-8").encode(charBuffer); +final byte[] bytes = Arrays.copyOfRange(byteBuffer.array(), byteBuffer.position(), +byteBuffer.limit()); + +final byte[] result = new byte[lastCharPosition + bytes.length]; + +/* copy truncated array minus last char */ +System.arraycopy(text, 0, result, 0, lastCharPosition); +/* copy last char */ +System.arraycopy(bytes, 0, result, lastCharPosition, bytes.length); + +return new Text(result); + + } /* not found keep looking for a beginning byte */ else { +--lastCharPosition; + } + +} +/* beginning of a valid char not found */ +throw new IllegalArgumentException( +"Could not truncate string, beginning of a valid char not found"); + } else { +return new Text(text); + } +} + +private static Text truncateLowerBound(final Text text) { + if(text.getBytes().length > MAX_BYTES_RECORDED) { +return truncateLowerBound(text.getBytes()); + } else { +return text; + } +} + + +private static Text truncateLowerBound(final byte[] text) { + + if(text.length > MAX_BYTES_RECORDED) { + +int truncateLen = MAX_BYTES_RECORDED; +int offset = 0; + +for(int i=0; i<5; i++) { + + byte b = text[truncateLen]; + /* check for the beginning of 1,2,3,4,5 bytes long char */ + offset = getCharLength(b); + + /* found beginning of a valid char */ + if(offset > 0) { +byte[] truncated = Arrays.copyOfRange(text, 0, (truncateLen)); --- End diff -- If you do: Text result = new Text(); result.set(text, 0, truncateLen); return result; you'll have 1 less copy of the bytes. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208393137 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -543,35 +551,73 @@ public void reset() { super.reset(); minimum = null; maximum = null; + isLowerBoundSet = false; + isUpperBoundSet = false; sum = 0; } @Override public void updateString(Text value) { if (minimum == null) { -maximum = minimum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(value); + maximum = truncateUpperBound(value); + isLowerBoundSet = true; + isUpperBoundSet = true; +} else { + maximum = minimum = new Text(value); +} } else if (minimum.compareTo(value) > 0) { -minimum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(value); + isLowerBoundSet = true; +}else { + minimum = new Text(value); +} } else if (maximum.compareTo(value) < 0) { -maximum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + maximum = truncateUpperBound(value); + isUpperBoundSet = true; +} else { + maximum = new Text(value); +} } sum += value.getLength(); } + @Override public void updateString(byte[] bytes, int offset, int length, int repetitions) { + byte[] input = Arrays.copyOfRange(bytes, offset, offset+(length)); --- End diff -- Please don't copy the array unless it is needed. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208392855 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -543,35 +551,73 @@ public void reset() { super.reset(); minimum = null; maximum = null; + isLowerBoundSet = false; + isUpperBoundSet = false; sum = 0; } @Override public void updateString(Text value) { if (minimum == null) { -maximum = minimum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(value); + maximum = truncateUpperBound(value); + isLowerBoundSet = true; + isUpperBoundSet = true; +} else { + maximum = minimum = new Text(value); +} } else if (minimum.compareTo(value) > 0) { -minimum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(value); + isLowerBoundSet = true; +}else { + minimum = new Text(value); --- End diff -- This needs to set isLowerBoundSet = false. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208398373 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -683,6 +787,150 @@ public int hashCode() { result = 31 * result + (int) (sum ^ (sum >>> 32)); return result; } + +/** + * A helper function that truncates the {@link Text} input + * based on {@link #MAX_BYTES_RECORDED} and increments + * the last codepoint by 1. + * @param text + * @return truncated Text value + */ +private static Text truncateUpperBound(final Text text) { + + if(text.getBytes().length > MAX_BYTES_RECORDED) { +return truncateUpperBound(text.getBytes()); + } else { +return text; + } + +} + +/** + * A helper function that truncates the {@link byte[]} input + * based on {@link #MAX_BYTES_RECORDED} and increments + * the last codepoint by 1. + * @param text + * @return truncated Text value + */ +private static Text truncateUpperBound(final byte[] text) { --- End diff -- You need the length passed in here. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208395531 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -584,16 +630,40 @@ public void merge(ColumnStatisticsImpl other) { if (str.minimum != null) { maximum = new Text(str.getMaximum()); --- End diff -- You'll need to copy the booleans from str also. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208393268 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -543,35 +551,73 @@ public void reset() { super.reset(); minimum = null; maximum = null; + isLowerBoundSet = false; + isUpperBoundSet = false; sum = 0; } @Override public void updateString(Text value) { if (minimum == null) { -maximum = minimum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(value); + maximum = truncateUpperBound(value); + isLowerBoundSet = true; + isUpperBoundSet = true; +} else { + maximum = minimum = new Text(value); +} } else if (minimum.compareTo(value) > 0) { -minimum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(value); + isLowerBoundSet = true; +}else { + minimum = new Text(value); +} } else if (maximum.compareTo(value) < 0) { -maximum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + maximum = truncateUpperBound(value); + isUpperBoundSet = true; +} else { + maximum = new Text(value); +} } sum += value.getLength(); } + @Override public void updateString(byte[] bytes, int offset, int length, int repetitions) { + byte[] input = Arrays.copyOfRange(bytes, offset, offset+(length)); if (minimum == null) { -maximum = minimum = new Text(); -maximum.set(bytes, offset, length); +if(length > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(input); + maximum = truncateUpperBound(input); + isLowerBoundSet = true; + isUpperBoundSet = true; +} else { + maximum = minimum = new Text(); + maximum.set(bytes, offset, length); +} } else if (WritableComparator.compareBytes(minimum.getBytes(), 0, minimum.getLength(), bytes, offset, length) > 0) { -minimum = new Text(); -minimum.set(bytes, offset, length); +if(length > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(input); + isLowerBoundSet = true; +} else { + minimum = new Text(); + minimum.set(bytes, offset, length); --- End diff -- Clear isLowerBoundSet. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208392450 --- Diff: java/core/src/java/org/apache/orc/StringColumnStatistics.java --- @@ -33,6 +33,20 @@ */ String getMaximum(); + /** + * Get the string with + * length = Min(StringStatisticsImpl.MAX_BYTES_RECORDED, getMinimum()) --- End diff -- I would phrase this as "Get the lower bound of the values in this column. The value may be truncated to at most MAX_BYTES_RECORDED." ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208392508 --- Diff: java/core/src/java/org/apache/orc/StringColumnStatistics.java --- @@ -33,6 +33,20 @@ */ String getMaximum(); + /** + * Get the string with + * length = Min(StringStatisticsImpl.MAX_BYTES_RECORDED, getMinimum()) + * @return lower bound + */ + String getLowerBound(); + + /** + * Get the string with + * length = Min(StringStatisticsImpl.MAX_BYTES_RECORDED, getMaximum()) --- End diff -- Same here. ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208398032 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -584,16 +630,40 @@ public void merge(ColumnStatisticsImpl other) { if (str.minimum != null) { maximum = new Text(str.getMaximum()); minimum = new Text(str.getMinimum()); - } else { + } + /* str.minimum == null when lower bound set */ + else if (str.isLowerBoundSet) { +minimum = new Text(str.getLowerBound()); +isLowerBoundSet = true; + +/* check for upper bound before setting max */ +if (str.isUpperBoundSet) { + maximum = new Text(str.getUpperBound()); + isUpperBoundSet = true; +} else { + maximum = new Text(str.getMaximum()); +} + } + else { /* both are empty */ maximum = minimum = null; } } else if (str.minimum != null) { if (minimum.compareTo(str.minimum) > 0) { -minimum = new Text(str.getMinimum()); +if(str.isLowerBoundSet) { + minimum = new Text(str.getLowerBound()); + isLowerBoundSet = true; +} else { + minimum = new Text(str.getMinimum()); --- End diff -- You could simplify this as: minimum = new Text(str.minimum) isLowerBoundSet = str.isLowerBoundSet; ---
[GitHub] orc pull request #299: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/299#discussion_r208393342 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -543,35 +551,73 @@ public void reset() { super.reset(); minimum = null; maximum = null; + isLowerBoundSet = false; + isUpperBoundSet = false; sum = 0; } @Override public void updateString(Text value) { if (minimum == null) { -maximum = minimum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(value); + maximum = truncateUpperBound(value); + isLowerBoundSet = true; + isUpperBoundSet = true; +} else { + maximum = minimum = new Text(value); +} } else if (minimum.compareTo(value) > 0) { -minimum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(value); + isLowerBoundSet = true; +}else { + minimum = new Text(value); +} } else if (maximum.compareTo(value) < 0) { -maximum = new Text(value); +if(value.getLength() > MAX_BYTES_RECORDED) { + maximum = truncateUpperBound(value); + isUpperBoundSet = true; +} else { + maximum = new Text(value); +} } sum += value.getLength(); } + @Override public void updateString(byte[] bytes, int offset, int length, int repetitions) { + byte[] input = Arrays.copyOfRange(bytes, offset, offset+(length)); if (minimum == null) { -maximum = minimum = new Text(); -maximum.set(bytes, offset, length); +if(length > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(input); + maximum = truncateUpperBound(input); + isLowerBoundSet = true; + isUpperBoundSet = true; +} else { + maximum = minimum = new Text(); + maximum.set(bytes, offset, length); +} } else if (WritableComparator.compareBytes(minimum.getBytes(), 0, minimum.getLength(), bytes, offset, length) > 0) { -minimum = new Text(); -minimum.set(bytes, offset, length); +if(length > MAX_BYTES_RECORDED) { + minimum = truncateLowerBound(input); + isLowerBoundSet = true; +} else { + minimum = new Text(); + minimum.set(bytes, offset, length); +} } else if (WritableComparator.compareBytes(maximum.getBytes(), 0, maximum.getLength(), bytes, offset, length) < 0) { -maximum = new Text(); -maximum.set(bytes, offset, length); +if(length > MAX_BYTES_RECORDED) { + maximum = truncateUpperBound(input); + isUpperBoundSet = true; +} else { + maximum = new Text(); + maximum.set(bytes, offset, length); --- End diff -- Clear isUpperBoundSet. ---
[GitHub] orc issue #292: ORC-203 - Update StringStatistics to trim long strings to 10...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/292 I'm sorry, I accidentally pushed this while it is still in review. I've reverted the change, but please reopen this PR. ---
[GitHub] orc pull request #298: ORC-393. Add ORC snapcraft definition.
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/298 ORC-393. Add ORC snapcraft definition. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-393 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/298.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #298 commit 9c82a14d8dd276fe4e3afcc87809c8ee9e97f843 Author: Owen O'Malley Date: 2018-08-03T19:53:37Z ORC-393. Add ORC snapcraft definition. ---
[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/292#discussion_r205179930 --- Diff: java/bench/README.md --- @@ -1,3 +1,20 @@ + --- End diff -- Let me check why this got flagged. We should have an exclusion for *.md files. ---
[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/292#discussion_r205179439 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -621,12 +703,54 @@ public void merge(ColumnStatisticsImpl other) { @Override public String getMinimum() { - return minimum == null ? null : minimum.toString(); + /* if we have lower bound set (in case of truncation) + getMinimum will be null */ + if(isLowerBoundSet) { +return null; + } else { +return minimum == null ? null : minimum.toString(); + } } @Override public String getMaximum() { - return maximum == null ? null : maximum.toString(); + /* if we have upper bound is set (in case of truncation) + getMaximum will be null */ + if(isUpperBoundSet) { +return null; + } else { +return maximum == null ? null : maximum.toString(); + } +} + +/** + * Get the string with + * length = Min(StringStatisticsImpl.MAX_STRING_LENGTH_RECORDED, getMinimum()) + * + * @return lower bound + */ +@Override +public String getLowerBound() { + if(isLowerBoundSet) { +return minimum.toString(); + } else { +return null; + } +} + +/** + * Get the string with + * length = Min(StringStatisticsImpl.MAX_STRING_LENGTH_RECORDED, getMaximum()) + * + * @return upper bound + */ +@Override +public String getUpperBound() { + if(isUpperBoundSet) { +return maximum.toString(); + } else { +return null; --- End diff -- This should always return maximum. ---
[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/292#discussion_r205180384 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -517,10 +519,14 @@ public int hashCode() { protected static final class StringStatisticsImpl extends ColumnStatisticsImpl implements StringColumnStatistics { +public static final int MAX_STRING_LENGTH_RECORDED = 1024; --- End diff -- We should probably use the number of bytes instead of the number of characters. I'd propose that we always truncate a whole number of characters, with a maximum of 1024 bytes. For ASCII strings, it will have the same result, but larger characters may end up at 1021 to 1024 bytes long. ---
[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/292#discussion_r205180578 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -543,35 +549,87 @@ public void reset() { super.reset(); minimum = null; maximum = null; + isLowerBoundSet = false; + isUpperBoundSet = false; sum = 0; } @Override public void updateString(Text value) { if (minimum == null) { -maximum = minimum = new Text(value); +if(value.toString().length() > MAX_STRING_LENGTH_RECORDED) { + minimum = new Text( --- End diff -- Let's make a corresponding truncateLowerBound so that we can have one place to change. ---
[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/292#discussion_r205179355 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -621,12 +703,54 @@ public void merge(ColumnStatisticsImpl other) { @Override public String getMinimum() { - return minimum == null ? null : minimum.toString(); + /* if we have lower bound set (in case of truncation) + getMinimum will be null */ + if(isLowerBoundSet) { +return null; + } else { +return minimum == null ? null : minimum.toString(); + } } @Override public String getMaximum() { - return maximum == null ? null : maximum.toString(); + /* if we have upper bound is set (in case of truncation) + getMaximum will be null */ + if(isUpperBoundSet) { +return null; + } else { +return maximum == null ? null : maximum.toString(); + } +} + +/** + * Get the string with + * length = Min(StringStatisticsImpl.MAX_STRING_LENGTH_RECORDED, getMinimum()) + * + * @return lower bound + */ +@Override +public String getLowerBound() { + if(isLowerBoundSet) { +return minimum.toString(); + } else { +return null; --- End diff -- This should always return minimum. ---
[GitHub] orc pull request #292: ORC-203 - Update StringStatistics to trim long string...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/292#discussion_r205181230 --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java --- @@ -543,35 +549,87 @@ public void reset() { super.reset(); minimum = null; maximum = null; + isLowerBoundSet = false; + isUpperBoundSet = false; sum = 0; } @Override public void updateString(Text value) { if (minimum == null) { -maximum = minimum = new Text(value); +if(value.toString().length() > MAX_STRING_LENGTH_RECORDED) { + minimum = new Text( --- End diff -- We can even use the structure of UTF-8 to do truncateLowerBound without forming a string. In particular you want to search from 1024 to find the first byte of a character (0xxx or 11xx ) and truncate just before it. ---
[GitHub] orc pull request #291: ORC-385. Change RecordReader to extend Closeable.
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/291 ORC-385. Change RecordReader to extend Closeable. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-385 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/291.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #291 commit 903a6d6049f918566938f9621fcfadb5f146a2ac Author: Owen O'Malley Date: 2018-07-13T21:55:45Z ORC-385. Change RecordReader to extend Closeable. Signed-off-by: Owen O'Malley ---
[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/278#discussion_r202473870 --- Diff: java/core/src/java/org/apache/orc/impl/InStream.java --- @@ -401,84 +596,149 @@ private String rangeString() { @Override public String toString() { return "compressed stream " + name + " position: " + currentOffset + - " length: " + length + " range: " + currentRange + - " offset: " + (compressed == null ? 0 : compressed.position()) + " limit: " + (compressed == null ? 0 : compressed.limit()) + + " length: " + length + " range: " + getRangeNumber(bytes, currentRange) + + " offset: " + (compressed == null ? 0 : compressed.position()) + + " limit: " + (compressed == null ? 0 : compressed.limit()) + rangeString() + (uncompressed == null ? "" : " uncompressed: " + uncompressed.position() + " to " + uncompressed.limit()); } } + private static class EncryptedCompressedStream extends CompressedStream { +private final EncryptionState encrypt; + +public EncryptedCompressedStream(String name, + DiskRangeList input, + long length, + StreamOptions options) { + super(name, length, options); + encrypt = new EncryptionState(name, options); + reset(input, length); +} + +@Override +protected void setCurrent(DiskRangeList newRange, boolean isJump) { + currentRange = newRange; + if (newRange != null) { +if (isJump) { + encrypt.changeIv(newRange.getOffset()); +} +compressed = encrypt.decrypt(newRange); +compressed.position((int) (currentOffset - newRange.getOffset())); + } +} + +@Override +public void close() { + super.close(); + encrypt.close(); +} + +@Override +public String toString() { + return "encrypted " + super.toString(); +} + } + public abstract void seek(PositionProvider index) throws IOException; + public static class StreamOptions implements Cloneable { +private CompressionCodec codec; +private int bufferSize; +private EncryptionAlgorithm algorithm; +private Key key; +private byte[] iv; + +public StreamOptions withCodec(CompressionCodec value) { + this.codec = value; + return this; +} + +public StreamOptions withBufferSize(int value) { + bufferSize = value; + return this; +} + +public StreamOptions withEncryption(EncryptionAlgorithm algorithm, +Key key, +byte[] iv) { + this.algorithm = algorithm; + this.key = key; + this.iv = iv; + return this; +} + +public CompressionCodec getCodec() { + return codec; +} + +@Override +public StreamOptions clone() { + try { +StreamOptions clone = (StreamOptions) super.clone(); +if (clone.codec != null) { + // Make sure we don't share the same codec between two readers. + clone.codec = OrcCodecPool.getCodec(codec.getKind()); +} +return clone; --- End diff -- *sigh* It is Cloneable, because it is used as a field in DefaultDataReader. The clone is correct, I believe. We really need a better solution for CompressionCodec than the current Cloneable and CompressionCodecPool mess. We need to remove the compression parameters out of Codec and then they are completely stateless again and can be trivially pooled. ---
[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/278#discussion_r202471794 --- Diff: java/core/src/java/org/apache/orc/impl/writer/StreamOptions.java --- @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.orc.impl.writer; + +import org.apache.orc.CompressionCodec; +import org.apache.orc.EncryptionAlgorithm; + +import java.security.Key; + +/** + * The compression and encryption options for writing a stream. + */ +public class StreamOptions { --- End diff -- It is org.apache.orc.impl.InStream.StreamOptions and org.apache.orc.impl.writer.StreamOptions. I guess they should be named the same way, but they are separate classes. ---
[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/278#discussion_r202470866 --- Diff: java/core/src/java/org/apache/orc/impl/InStream.java --- @@ -173,33 +203,208 @@ private static ByteBuffer allocateBuffer(int size, boolean isDirect) { } } + /** + * Manage the state of the decryption, including the ability to seek. + */ + static class EncryptionState { +private final String name; +private final EncryptionAlgorithm algorithm; +private final Key key; +private final byte[] iv; +private final Cipher cipher; +private ByteBuffer decrypted; + +EncryptionState(String name, StreamOptions options) { + this.name = name; + algorithm = options.algorithm; + key = options.key; + iv = options.iv; + cipher = algorithm.createCipher(); +} + +/** + * We are seeking to a new range, so update the cipher to change the IV + * to match. This code assumes that we only support encryption in CTR mode. + * @param offset where we are seeking to in the stream + */ +void changeIv(long offset) { + int blockSize = cipher.getBlockSize(); + long encryptionBlocks = offset / blockSize; --- End diff -- No, we always start decryption on a block boundary and throw away the leading bytes. See InStream.EncryptionState.changeIv(long offset) and the "extra" and "wasted" variables. ---
[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/278#discussion_r202433617 --- Diff: java/core/src/java/org/apache/orc/impl/CryptoUtils.java --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.orc.impl; + +import org.apache.orc.EncryptionAlgorithm; +import java.security.SecureRandom; + +/** + * This class has routines to work with encryption within ORC files. + */ +public class CryptoUtils { + + private static final int COLUMN_ID_LENGTH = 3; + private static final int KIND_LENGTH = 2; + private static final int STRIPE_ID_LENGTH = 3; + private static final int MIN_COUNT_BYTES = 8; + + private static final SecureRandom random = new SecureRandom(); --- End diff -- oops. Removed. ---
[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/278#discussion_r202433445 --- Diff: java/core/src/java/org/apache/orc/impl/InStream.java --- @@ -55,112 +60,137 @@ public long getStreamLength() { @Override public abstract void close(); + static int getRangeNumber(DiskRangeList list, DiskRangeList current) { +int result = 0; +DiskRangeList range = list; +while (range != null && range != current) { + result += 1; + range = range.next; +} +return result; + } + + /** + * Implements a stream over an uncompressed stream. + */ public static class UncompressedStream extends InStream { -private List bytes; +private DiskRangeList bytes; private long length; protected long currentOffset; -private ByteBuffer range; -private int currentRange; +protected ByteBuffer decrypted; +protected DiskRangeList currentRange; + +/** + * Create the stream without calling reset on it. + * This is used for the subclass that needs to do more setup. + * @param name name of the stream + * @param length the number of bytes for the stream + */ +public UncompressedStream(String name, long length) { + super(name, length); +} -public UncompressedStream(String name, List input, long length) { +public UncompressedStream(String name, + DiskRangeList input, + long length) { super(name, length); reset(input, length); } -protected void reset(List input, long length) { +protected void reset(DiskRangeList input, long length) { this.bytes = input; this.length = length; - currentRange = 0; - currentOffset = 0; - range = null; + currentOffset = input == null ? 0 : input.getOffset(); + setCurrent(input, true); } @Override public int read() { - if (range == null || range.remaining() == 0) { + if (decrypted == null || decrypted.remaining() == 0) { if (currentOffset == length) { return -1; } -seek(currentOffset); +setCurrent(currentRange.next, false); } currentOffset += 1; - return 0xff & range.get(); + return 0xff & decrypted.get(); +} + +protected void setCurrent(DiskRangeList newRange, boolean isJump) { + currentRange = newRange; + if (newRange != null) { +decrypted = newRange.getData().slice(); +decrypted.position((int) (currentOffset - newRange.getOffset())); + } } @Override public int read(byte[] data, int offset, int length) { - if (range == null || range.remaining() == 0) { + if (decrypted == null || decrypted.remaining() == 0) { if (currentOffset == this.length) { return -1; } -seek(currentOffset); +setCurrent(currentRange.next, false); } - int actualLength = Math.min(length, range.remaining()); - range.get(data, offset, actualLength); + int actualLength = Math.min(length, decrypted.remaining()); + decrypted.get(data, offset, actualLength); currentOffset += actualLength; return actualLength; } @Override public int available() { - if (range != null && range.remaining() > 0) { -return range.remaining(); + if (decrypted != null && decrypted.remaining() > 0) { +return decrypted.remaining(); } return (int) (length - currentOffset); } @Override public void close() { - currentRange = bytes.size(); + currentRange = null; currentOffset = length; // explicit de-ref of bytes[] - bytes.clear(); + decrypted = null; + bytes = null; } @Override public void seek(PositionProvider index) throws IOException { seek(index.getNext()); } -public void seek(long desired) { - if (desired == 0 && bytes.isEmpty()) { +public void seek(long desired) throws IOException { + if (desired == 0 && bytes == null) { return; } - int i = 0; - for (DiskRange curRange : bytes) { -if (curRange.getOffset() <= desired && -(desired - curRange.getOffset()) < curRange.getLength()) { - currentOffset = desired; - curren
[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/278#discussion_r202431172 --- Diff: java/core/src/java/org/apache/orc/impl/InStream.java --- @@ -55,112 +60,137 @@ public long getStreamLength() { @Override public abstract void close(); + static int getRangeNumber(DiskRangeList list, DiskRangeList current) { +int result = 0; +DiskRangeList range = list; +while (range != null && range != current) { + result += 1; + range = range.next; +} +return result; + } + + /** + * Implements a stream over an uncompressed stream. + */ public static class UncompressedStream extends InStream { -private List bytes; +private DiskRangeList bytes; private long length; protected long currentOffset; -private ByteBuffer range; -private int currentRange; +protected ByteBuffer decrypted; +protected DiskRangeList currentRange; + +/** + * Create the stream without calling reset on it. + * This is used for the subclass that needs to do more setup. + * @param name name of the stream + * @param length the number of bytes for the stream + */ +public UncompressedStream(String name, long length) { + super(name, length); +} -public UncompressedStream(String name, List input, long length) { +public UncompressedStream(String name, + DiskRangeList input, + long length) { super(name, length); reset(input, length); } -protected void reset(List input, long length) { +protected void reset(DiskRangeList input, long length) { this.bytes = input; this.length = length; - currentRange = 0; - currentOffset = 0; - range = null; + currentOffset = input == null ? 0 : input.getOffset(); + setCurrent(input, true); } @Override public int read() { - if (range == null || range.remaining() == 0) { + if (decrypted == null || decrypted.remaining() == 0) { if (currentOffset == length) { return -1; } -seek(currentOffset); +setCurrent(currentRange.next, false); } currentOffset += 1; - return 0xff & range.get(); + return 0xff & decrypted.get(); +} + +protected void setCurrent(DiskRangeList newRange, boolean isJump) { + currentRange = newRange; + if (newRange != null) { +decrypted = newRange.getData().slice(); +decrypted.position((int) (currentOffset - newRange.getOffset())); --- End diff -- But I'll add a comment to the code. ---
[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/278#discussion_r202430812 --- Diff: java/core/src/java/org/apache/orc/impl/InStream.java --- @@ -55,112 +60,137 @@ public long getStreamLength() { @Override public abstract void close(); + static int getRangeNumber(DiskRangeList list, DiskRangeList current) { +int result = 0; +DiskRangeList range = list; +while (range != null && range != current) { + result += 1; + range = range.next; +} +return result; + } + + /** + * Implements a stream over an uncompressed stream. + */ public static class UncompressedStream extends InStream { -private List bytes; +private DiskRangeList bytes; private long length; protected long currentOffset; -private ByteBuffer range; -private int currentRange; +protected ByteBuffer decrypted; +protected DiskRangeList currentRange; + +/** + * Create the stream without calling reset on it. + * This is used for the subclass that needs to do more setup. + * @param name name of the stream + * @param length the number of bytes for the stream + */ +public UncompressedStream(String name, long length) { + super(name, length); +} -public UncompressedStream(String name, List input, long length) { +public UncompressedStream(String name, + DiskRangeList input, + long length) { super(name, length); reset(input, length); } -protected void reset(List input, long length) { +protected void reset(DiskRangeList input, long length) { this.bytes = input; this.length = length; - currentRange = 0; - currentOffset = 0; - range = null; + currentOffset = input == null ? 0 : input.getOffset(); + setCurrent(input, true); } @Override public int read() { - if (range == null || range.remaining() == 0) { + if (decrypted == null || decrypted.remaining() == 0) { if (currentOffset == length) { return -1; } -seek(currentOffset); +setCurrent(currentRange.next, false); } currentOffset += 1; - return 0xff & range.get(); + return 0xff & decrypted.get(); +} + +protected void setCurrent(DiskRangeList newRange, boolean isJump) { + currentRange = newRange; + if (newRange != null) { +decrypted = newRange.getData().slice(); +decrypted.position((int) (currentOffset - newRange.getOffset())); --- End diff -- No, it is moving the ByteBuffer position to the correct spot so that it matches currentOffset, which is relative to the stream. ---
[GitHub] orc issue #290: ORC-386 Add spark benchmarks.
Github user omalley commented on the issue: https://github.com/apache/orc/pull/290 @dongjoon-hyun if you have feedback on these benchmarks, that would be great. ---
[GitHub] orc pull request #290: ORC-386 Add spark benchmarks.
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/290 ORC-386 Add spark benchmarks. I also refactored all of the old benchmarks to reduce the common code. I also split it into three modules so that I could separate the common code, the code that depends on hive, and the code that depends on spark. Avoiding building an uber jar that has both hive and spark made life much easier. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-386 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/290.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #290 commit 1300f0ce20359e837d5e75bc49c4acc3ee5c7221 Author: Owen O'Malley Date: 2018-06-09T21:37:31Z ORC-386 Add spark benchmarks. I also refactored all of the old benchmarks to reduce the common code. I also split it into three modules so that I could separate the common code, the code that depends on hive, and the code that depends on spark. Avoiding building an uber jar that has both hive and spark made life much easier. ---
[GitHub] orc issue #281: ORC-375: Update libhdfs to compile with gcc7.
Github user omalley commented on the issue: https://github.com/apache/orc/pull/281 I accidentally pushed the temporary commit to branch-1.5. I've done a force push to undo the change. ---
[GitHub] orc pull request #281: ORC-375: Update libhdfs to compile with gcc7.
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/281 ORC-375: Update libhdfs to compile with gcc7. This changes the code in the libhdfs tarball, but we need this change applied upstream. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-375 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/281.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #281 commit 497f8698f53148ed2cd7e7753a3d9e31e4e13389 Author: Owen O'Malley Date: 2018-06-06T05:17:16Z ORC-375: Fix libhdfs on gcc7 by adding #include two places. ---
[GitHub] orc issue #280: ORC-376: Add Ubuntu18 docker file
Github user omalley commented on the issue: https://github.com/apache/orc/pull/280 I'll also comment that the default openjdk is 10, which does not work. I've set the docker file to use openjdk 8 instead, which does work. With openjdk 10 maven crashes with a NullPointerException trying to compile ORC. ---
[GitHub] orc pull request #280: ORC-376: Add Ubuntu18 docker file
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/280 ORC-376: Add Ubuntu18 docker file We need to be able to test gcc 7 and ubuntu 18.04. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-376 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/280.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #280 commit 82110ce6408fe764d95a7dcaf0145f108660afe2 Author: Owen O'Malley Date: 2018-06-06T14:18:42Z ORC-376: Add Ubuntu18 docker file ---
[GitHub] orc pull request #277: ORC-372: Enable valgrind for C++ travis-ci tests
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/277#discussion_r193106126 --- Diff: c++/test/CMakeLists.txt --- @@ -64,7 +64,12 @@ target_link_libraries (create-test-files protobuf ) -add_test (NAME orc-test COMMAND orc-test) +if (TEST_VALGRIND_MEMCHECK) + add_test (orc-test --- End diff -- You have trailing spaces here. ---
[GitHub] orc pull request #277: ORC-372: Enable valgrind for C++ travis-ci tests
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/277#discussion_r19313 --- Diff: c++/src/Compression.cc --- @@ -199,13 +199,16 @@ namespace orc { uint64_t blockSize, MemoryPool& pool); +~ZlibCompressionStream() { end(); } --- End diff -- You need an override on this or the mac complains. ---
[GitHub] orc issue #279: ORC-373: Option to disable dictionary encoding
Github user omalley commented on the issue: https://github.com/apache/orc/pull/279 Prasanth, rather than adding the flush count, which is really test code in the main source, how about creating a StringTreeWriter and catching the stream directly. I'd propose this - https://github.com/omalley/orc/blob/8a782d204f071e2bd16bcfd591a54319fadff132/java/core/src/test/org/apache/orc/TestStringDictionary.java#L251 ---
[GitHub] orc issue #276: ORC-373: Option to disable dictionary encoding
Github user omalley commented on the issue: https://github.com/apache/orc/pull/276 @prasanthj yeah, in general you should make a dev branch for the pull request (eg. prasanthj:orc-373). Then you can do a force-push to the branch to update the PR. ---
[GitHub] orc pull request #278: ORC-251: Extend InStream and OutStream to support enc...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/278 ORC-251: Extend InStream and OutStream to support encryption. This patch: * Adds a method to Codec to get the CompressionKind. * Creates StreamOptions for both InStream and OutStream to gather together the parameters they need. * Extends InStream and OutStream to handle encryption. * Changes InStream to use DiskRangeList instead of List. * Creates CryptoUtils with a method to create an IV based on the stream name. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-251 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/278.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #278 commit ff620c3faccbabc7e011e245dfb3bcdbfef41b7a Author: Owen O'Malley Date: 2018-05-09T16:36:28Z ORC-251: Extend InStream and OutStream to support encryption. ---
[GitHub] orc pull request #272: ORC-367: Fix incorrect reads of boolean columns after...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/272 ORC-367: Fix incorrect reads of boolean columns after a seek. Fix #272 Signed-off-by: Owen O'Malley <omal...@apache.org> You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-367 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/272.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #272 commit b0504f55e4a68de0c73b89ab5bccb2a673efe9b4 Author: Owen O'Malley <omalley@...> Date: 2018-05-22T21:43:06Z ORC-367: Fix incorrect reads of boolean columns after a seek. Fix #272 Signed-off-by: Owen O'Malley <omal...@apache.org> ---
[GitHub] orc issue #268: ORC-363 Enable zstd decompression in ORC Java reader
Github user omalley commented on the issue: https://github.com/apache/orc/pull/268 I'd rather hold off on this until we can do an end-to-end test. ---
[GitHub] orc pull request #269: ORC-116. Add examples of reading and writing ORC file...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/269 ORC-116. Add examples of reading and writing ORC files from Java. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-116 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/269.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #269 commit e7231cd197f220877a5086b2bd516fc367081b24 Author: Owen O'Malley <omalley@...> Date: 2016-12-06T17:34:55Z ORC-116. Add examples of reading and writing ORC files from Java. Signed-off-by: Owen O'Malley <omal...@apache.org> ---
[GitHub] orc pull request #267: ORC-364. Minor javadoc fixup.
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/267 ORC-364. Minor javadoc fixup. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-364 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/267.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #267 commit a8565edad13700d597c239faab1992d2b6fa1303 Author: Owen O'Malley <omalley@...> Date: 2018-05-07T18:08:11Z Preparing for ORC 1.5.0 release. Signed-off-by: Owen O'Malley <omal...@apache.org> commit 249781a518faa3afc82dd84821ddb227d0badabe Author: Owen O'Malley <omalley@...> Date: 2018-05-14T21:00:59Z ORC-364. Minor fixups on javadoc. Signed-off-by: Owen O'Malley <omal...@apache.org> ---
[GitHub] orc pull request #265: ORC-334: [C++] Add AppVeyor support for integration o...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/265#discussion_r186474701 --- Diff: c++/src/Timezone.cc --- @@ -710,7 +710,11 @@ namespace orc { * Get the local timezone. */ const Timezone& getLocalTimezone() { +#ifdef _MSC_VER +return getTimezoneByName("UTC"); --- End diff -- Because there isn't a direct windows equivalent. Here is what Java writes about it- https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/time-zone002.html There is a registry setting, but it uses a non-standard set of names *doh*. The easiest way would be to create a Java jar and run that to determine the local timezone. ---
[GitHub] orc pull request #266: ORC-360. Add Java error checking on the types when re...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/266 ORC-360. Add Java error checking on the types when reading. This patch adds a check on the types in the footer to ensure that the type is properly formed before we start building the tree. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-360 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/266.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #266 commit f236cf727cc295e93ae1e2347be60562fff17fa8 Author: Owen O'Malley <omalley@...> Date: 2018-05-07T16:19:17Z ORC-360. Add Java error checking on the types when reading. ---
[GitHub] orc pull request #249: [ORC-341] Support time zone as a parameter for Java r...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/249#discussion_r186137062 --- Diff: java/core/src/java/org/apache/orc/impl/writer/WriterImplV2.java --- @@ -373,7 +379,11 @@ private void flushStripe() throws IOException { OrcProto.StripeFooter.Builder builder = OrcProto.StripeFooter.newBuilder(); if (writeTimeZone) { -builder.setWriterTimezone(TimeZone.getDefault().getID()); +if (useUTCTimeZone) { + builder.setWriterTimezone(TimeZone.getTimeZone("UTC").getID()); --- End diff -- I'd be tempted to just use setWriterTimezone("UTC"), because we'll already fail if UTC is called something else. ---
[GitHub] orc pull request #249: [ORC-341] Support time zone as a parameter for Java r...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/249#discussion_r186133762 --- Diff: java/core/src/java/org/apache/orc/OrcFile.java --- @@ -320,6 +321,16 @@ public ReaderOptions fileMetadata(final FileMetadata metadata) { public FileMetadata getFileMetadata() { return fileMetadata; } + +public ReaderOptions useUTCTimestamp(boolean value) { + useUTCTimestamp = value; + return this; +} + +public boolean isUseUTCTimestamp() { --- End diff -- This should be getUseUTCTimestamp. ---
[GitHub] orc pull request #249: [ORC-341] Support time zone as a parameter for Java r...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/249#discussion_r186134015 --- Diff: java/core/src/java/org/apache/orc/OrcFile.java --- @@ -320,6 +321,16 @@ public ReaderOptions fileMetadata(final FileMetadata metadata) { public FileMetadata getFileMetadata() { return fileMetadata; } + +public ReaderOptions useUTCTimestamp(boolean value) { --- End diff -- This should also cause the TimestampStatistics to use UTC. ---
[GitHub] orc pull request #249: [ORC-341] Support time zone as a parameter for Java r...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/249#discussion_r186134142 --- Diff: java/core/src/java/org/apache/orc/OrcFile.java --- @@ -761,6 +782,10 @@ public boolean getWriteVariableLengthBlocks() { public HadoopShims getHadoopShims() { return shims; } + +public boolean isUseUTCTimestamp() { --- End diff -- Rename this to getUseUTCTimestamp. ---
[GitHub] orc pull request #249: [ORC-341] Support time zone as a parameter for Java r...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/249#discussion_r186140770 --- Diff: java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java --- @@ -54,9 +57,20 @@ public TimestampTreeWriter(int columnId, if (rowIndexPosition != null) { recordPosition(rowIndexPosition); } -this.localTimezone = TimeZone.getDefault(); -// for unit tests to set different time zones -this.baseEpochSecsLocalTz = Timestamp.valueOf(BASE_TIMESTAMP_STRING).getTime() / MILLIS_PER_SECOND; +if (writer.isUseUTCTimestamp()) { + this.localTimezone = TimeZone.getTimeZone("UTC"); +} else { + this.localTimezone = TimeZone.getDefault(); +} +this.localDateFormat = new SimpleDateFormat("-MM-dd HH:mm:ss"); --- End diff -- It sucks that there isn't a simpler way in Java to do this. ---
[GitHub] orc pull request #265: ORC-334: [C++] Add AppVeyor support for integration o...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/265#discussion_r186114158 --- Diff: c++/test/CreateTestFiles.cc --- @@ -31,8 +31,8 @@ void writeCustomOrcFile(const std::string& filename, const orc::proto::Metadata& metadata, const orc::proto::Footer& footer, -const std::vector& version, -uint writerVersion) { +const std::vector& version, --- End diff -- We generally prefer to use explicitly sized types, so uint64_t or uint32_t would be better. ---
[GitHub] orc issue #252: ORC-345 Create and use a new Decimal64StatisticsImpl on the ...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/252 I should comment while adding the tests that it became clear that I didn't need the sign change logic, because (10**18 - 1 )* 2 < 2**63. So by enforcing the 10**18 limit, I didn't need the more complicated sign change detection. ---
[GitHub] orc pull request #252: ORC-345 Create and use a new Decimal64StatisticsImpl ...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/252#discussion_r185846011 --- Diff: java/bench/src/java/org/apache/orc/bench/DecimalBench.java --- @@ -0,0 +1,270 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.orc.bench; + +import com.google.gson.JsonStreamParser; +import org.apache.avro.file.DataFileReader; +import org.apache.avro.generic.GenericDatumReader; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.io.DatumReader; +import org.apache.avro.mapred.FsInput; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.TrackingLocalFileSystem; +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.Decimal64ColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; +import org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.Reporter; +import org.apache.orc.CompressionKind; +import org.apache.orc.OrcFile; +import org.apache.orc.Reader; +import org.apache.orc.RecordReader; +import org.apache.orc.TypeDescription; +import org.apache.orc.Writer; +import org.apache.orc.bench.convert.BatchReader; +import org.apache.orc.bench.convert.GenerateVariants; +import org.apache.orc.bench.convert.csv.CsvReader; +import org.apache.parquet.hadoop.ParquetInputFormat; +import org.openjdk.jmh.annotations.AuxCounters; +import org.openjdk.jmh.annotations.Benchmark; +import org.openjdk.jmh.annotations.BenchmarkMode; +import org.openjdk.jmh.annotations.Fork; +import org.openjdk.jmh.annotations.Level; +import org.openjdk.jmh.annotations.Measurement; +import org.openjdk.jmh.annotations.Mode; +import org.openjdk.jmh.annotations.OutputTimeUnit; +import org.openjdk.jmh.annotations.Param; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; +import org.openjdk.jmh.annotations.TearDown; +import org.openjdk.jmh.annotations.Warmup; +import org.openjdk.jmh.infra.Blackhole; +import org.openjdk.jmh.runner.Runner; +import org.openjdk.jmh.runner.options.OptionsBuilder; + +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.net.URI; +import java.nio.charset.StandardCharsets; +import java.util.concurrent.TimeUnit; + +@BenchmarkMode(Mode.AverageTime) +@Warmup(iterations=2, time=30, timeUnit = TimeUnit.SECONDS) +@Measurement(iterations=10, time=30, timeUnit = TimeUnit.SECONDS) +@State(Scope.Thread) +@OutputTimeUnit(TimeUnit.MICROSECONDS) +@Fork(2) +public class DecimalBench { + + private static final String ROOT_ENVIRONMENT_NAME = "bench.root.dir"; + private static final Path root; + static { +String value = System.getProperty(ROOT_ENVIRONMENT_NAME); +root = value == null ? null : new Path(value); + } + + /** + * Abstract out whether we are writing short or long decimals + */ + interface Loader { +/** + * Load the data from the values array into the ColumnVector. + * @param vector the output + * @param values the intput + * @param offset the first input value + * @param length the number of values to copy + */ +void loadData(ColumnVector vector, long[] values, int offset, int length); + } + + static class Decimal64Loader implements Loa
[GitHub] orc pull request #252: ORC-345 Create and use a new Decimal64StatisticsImpl ...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/252#discussion_r185845484 --- Diff: java/bench/src/java/org/apache/orc/bench/DecimalBench.java --- @@ -0,0 +1,270 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.orc.bench; + +import com.google.gson.JsonStreamParser; +import org.apache.avro.file.DataFileReader; +import org.apache.avro.generic.GenericDatumReader; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.io.DatumReader; +import org.apache.avro.mapred.FsInput; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.TrackingLocalFileSystem; +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.Decimal64ColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; +import org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.Reporter; +import org.apache.orc.CompressionKind; +import org.apache.orc.OrcFile; +import org.apache.orc.Reader; +import org.apache.orc.RecordReader; +import org.apache.orc.TypeDescription; +import org.apache.orc.Writer; +import org.apache.orc.bench.convert.BatchReader; +import org.apache.orc.bench.convert.GenerateVariants; +import org.apache.orc.bench.convert.csv.CsvReader; +import org.apache.parquet.hadoop.ParquetInputFormat; +import org.openjdk.jmh.annotations.AuxCounters; +import org.openjdk.jmh.annotations.Benchmark; +import org.openjdk.jmh.annotations.BenchmarkMode; +import org.openjdk.jmh.annotations.Fork; +import org.openjdk.jmh.annotations.Level; +import org.openjdk.jmh.annotations.Measurement; +import org.openjdk.jmh.annotations.Mode; +import org.openjdk.jmh.annotations.OutputTimeUnit; +import org.openjdk.jmh.annotations.Param; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; +import org.openjdk.jmh.annotations.TearDown; +import org.openjdk.jmh.annotations.Warmup; +import org.openjdk.jmh.infra.Blackhole; +import org.openjdk.jmh.runner.Runner; +import org.openjdk.jmh.runner.options.OptionsBuilder; + +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.net.URI; +import java.nio.charset.StandardCharsets; +import java.util.concurrent.TimeUnit; + +@BenchmarkMode(Mode.AverageTime) +@Warmup(iterations=2, time=30, timeUnit = TimeUnit.SECONDS) +@Measurement(iterations=10, time=30, timeUnit = TimeUnit.SECONDS) +@State(Scope.Thread) +@OutputTimeUnit(TimeUnit.MICROSECONDS) +@Fork(2) +public class DecimalBench { + + private static final String ROOT_ENVIRONMENT_NAME = "bench.root.dir"; + private static final Path root; + static { +String value = System.getProperty(ROOT_ENVIRONMENT_NAME); +root = value == null ? null : new Path(value); + } + + /** + * Abstract out whether we are writing short or long decimals + */ + interface Loader { +/** + * Load the data from the values array into the ColumnVector. + * @param vector the output + * @param values the intput + * @param offset the first input value + * @param length the number of values to copy + */ +void loadData(ColumnVector vector, long[] values, int offset, int length); + } + + static class Decimal64Loader implements Loa
[GitHub] orc pull request #252: ORC-345 Create and use a new Decimal64StatisticsImpl ...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/252#discussion_r185845008 --- Diff: java/bench/src/java/org/apache/orc/bench/DecimalBench.java --- @@ -0,0 +1,270 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.orc.bench; + +import com.google.gson.JsonStreamParser; +import org.apache.avro.file.DataFileReader; +import org.apache.avro.generic.GenericDatumReader; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.io.DatumReader; +import org.apache.avro.mapred.FsInput; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.TrackingLocalFileSystem; +import org.apache.hadoop.hive.ql.exec.vector.ColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.Decimal64ColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.DecimalColumnVector; +import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; +import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; +import org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.Reporter; +import org.apache.orc.CompressionKind; +import org.apache.orc.OrcFile; +import org.apache.orc.Reader; +import org.apache.orc.RecordReader; +import org.apache.orc.TypeDescription; +import org.apache.orc.Writer; +import org.apache.orc.bench.convert.BatchReader; +import org.apache.orc.bench.convert.GenerateVariants; +import org.apache.orc.bench.convert.csv.CsvReader; +import org.apache.parquet.hadoop.ParquetInputFormat; +import org.openjdk.jmh.annotations.AuxCounters; +import org.openjdk.jmh.annotations.Benchmark; +import org.openjdk.jmh.annotations.BenchmarkMode; +import org.openjdk.jmh.annotations.Fork; +import org.openjdk.jmh.annotations.Level; +import org.openjdk.jmh.annotations.Measurement; +import org.openjdk.jmh.annotations.Mode; +import org.openjdk.jmh.annotations.OutputTimeUnit; +import org.openjdk.jmh.annotations.Param; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; +import org.openjdk.jmh.annotations.TearDown; +import org.openjdk.jmh.annotations.Warmup; +import org.openjdk.jmh.infra.Blackhole; +import org.openjdk.jmh.runner.Runner; +import org.openjdk.jmh.runner.options.OptionsBuilder; + +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.net.URI; +import java.nio.charset.StandardCharsets; +import java.util.concurrent.TimeUnit; + +@BenchmarkMode(Mode.AverageTime) +@Warmup(iterations=2, time=30, timeUnit = TimeUnit.SECONDS) +@Measurement(iterations=10, time=30, timeUnit = TimeUnit.SECONDS) +@State(Scope.Thread) +@OutputTimeUnit(TimeUnit.MICROSECONDS) +@Fork(2) +public class DecimalBench { + + private static final String ROOT_ENVIRONMENT_NAME = "bench.root.dir"; + private static final Path root; + static { +String value = System.getProperty(ROOT_ENVIRONMENT_NAME); +root = value == null ? null : new Path(value); + } + + /** + * Abstract out whether we are writing short or long decimals + */ + interface Loader { +/** + * Load the data from the values array into the ColumnVector. + * @param vector the output + * @param values the intput + * @param offset the first input value + * @param length the number of values to copy + */ +void loadData(ColumnVector vector, long[] values, int offset, int length); + } + + static class Decimal64Loader implements Loa
[GitHub] orc issue #252: ORC-345 Create and use a new Decimal64StatisticsImpl on the ...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/252 Ok, with this and ORC-344, the difference in writing small vs large decimals to a null file system is huge: Benchmark (precision) Mode Cnt Score Error Units DecimalBench.write8 avgt 30 36914.791 ± 866.408 us/op DecimalBench.write 19 avgt 30 240789.318 ± 5146.880 us/op So I'm seeing a 6x speed up when using precision = 8 with the new code path instead of p = 19 and the old code path. ---
[GitHub] orc issue #252: ORC-345 Create and use a new Decimal64StatisticsImpl on the ...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/252 Ok, I update this pull request: * I changed updateDecimal64 to take a value and scale. * I check for more overflows in Decimal64ColumnStatisticsImpl. * I added a test that tests the new code. ---
[GitHub] orc pull request #262: ORC-354. Restore the benchmark module.
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/262 ORC-354. Restore the benchmark module. This reverts commit b86d70aa73289b86e066cc019ea11e0d83c1e40d. Signed-off-by: Owen O'Malley <omal...@apache.org> You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-354 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/262.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #262 commit a97d6ca526be64da507ffa0cdb3324a95ec1c896 Author: Owen O'Malley <omalley@...> Date: 2018-04-27T23:58:48Z ORC-354. Restore the benchmark module. This reverts commit b86d70aa73289b86e066cc019ea11e0d83c1e40d. Signed-off-by: Owen O'Malley <omal...@apache.org> ---
[GitHub] orc issue #257: ORC-49. Create RLE-based encoding for short decimals in ORCv...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/257 Part of what we absolutely need in RLEv3 is zero suppression for the whole row batch. So that if you have integers like: - 1000 - 2000 - 3000 - 1 - 4000 - 1000 it will record 3 for the number of zeros and encode: - 1 - 2 - 3 - 10 - 4 - 1 ---
[GitHub] orc issue #253: ORC-346: [C++] Add one second when writing negative Timestam...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/253 *Sigh* This is caused by the fact that Java defined Timestamp badly and then Hive copied those semantics into the ColumnVector. We really need to fix these semantics in ORCv2. I'll also comment that it would probably be nicer to our users if we weren't modifying the data in the columnVector they passed into us, but clearly we are already doing that. ---
[GitHub] orc pull request #257: ORC-49. Create RLE-based encoding for short decimals ...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/257 ORC-49. Create RLE-based encoding for short decimals in ORCv2. Create the first pass of the new decimal encoding for ORCv2. For small decimals, it uses RLEv2 and the old format for long decimals. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-49 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/257.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #257 commit 0f50ad4a22e19ee06a97b6289a01dbc5671268c5 Author: Owen O'Malley <omalley@...> Date: 2018-04-27T05:13:26Z ORC-49. Create RLE-based encoding for short decimals in ORCv2. ---
[GitHub] orc pull request #252: ORC-345 Create and use a new Decimal64StatisticsImpl ...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/252 ORC-345 Create and use a new Decimal64StatisticsImpl on the write path Currently the DecimalTreeWriter spends a lot of time maintaining the statistics for decimal objects. For short decimals (p <= 18) we can use longs to track the min, max, and sum. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-345 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/252.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #252 commit 12c37d717010e5b683074633b273b0bad0180e05 Author: Owen O'Malley <omalley@...> Date: 2018-04-18T03:16:12Z ORC-344. Support the new Decimal64ColumnVector. Fixes #250 Signed-off-by: Owen O'Malley <omal...@apache.org> commit e7c395ed2e5a68e480e31c4ba6d74237e5025a33 Author: Owen O'Malley <omalley@...> Date: 2018-04-18T21:30:35Z ORC-345. Create Decimal64StatisticsImpl. ---
[GitHub] orc pull request #:
Github user omalley commented on the pull request: https://github.com/apache/orc/commit/240931203c6a1c7e43c010b3bfa8f6953cc92f1e#commitcomment-28641687 I did put a link in at the top of https://orc.apache.org/develop/ It felt more like a developer artifact that a user artifact, although adding a link somewhere in the documentation would be good. ---
[GitHub] orc pull request #250: ORC-344. Support the new Decimal64ColumnVector.
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/250 ORC-344. Support the new Decimal64ColumnVector. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-344 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/250.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #250 commit a6b71376be6ace6b39bd9e2de63989c9b7b031ac Author: Owen O'Malley <omalley@...> Date: 2018-04-18T03:16:12Z ORC-344. Support the new Decimal64ColumnVector. Signed-off-by: Owen O'Malley <omal...@apache.org> ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/249 Also note that the C++ reader already uses UTC for its TimestampColumnVector. :) ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/249 @jcamachor I'd suggest a much simpler API: - Instead of passing in the reader timezone, make a boolean option to useUtcForTimestamp. - Extend TimestampColumnVector to have a boolean isUTC field. - The TimestampTreeWriter can use the isUTC in the ColumnVector to determine if it is in UTC. - The reader can set isUTC appropriately based on the option. Thoughts? ---
[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/247#discussion_r182139793 --- Diff: site/specification/ORCv2.md --- @@ -0,0 +1,1032 @@ +--- +layout: page +title: Evolving Draft for ORC Specification v2 +--- + +This specification is rapidly evolving and should only be used for +developers on the project. + +# TO DO items + +The list of things that we plan to change: + +* Create a decimal representation with fixed scale using rle. +* Create a better float/double encoding that splits mantissa and + exponent. +* Create a dictionary encoding for float, double, and decimal. +* Create RLEv3: + * 64 and 128 bit variants --- End diff -- Just to clarify the answer "variant" means "one of the instances of something that varies." The unfortunately similar looking "varint" is an unofficial contraction of "variable length integer." ---
[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/247#discussion_r182138208 --- Diff: site/specification/ORCv2.md --- @@ -0,0 +1,1032 @@ +--- +layout: page +title: Evolving Draft for ORC Specification v2 +--- + +This specification is rapidly evolving and should only be used for +developers on the project. + +# TO DO items + +The list of things that we plan to change: + +* Create a decimal representation with fixed scale using rle. +* Create a better float/double encoding that splits mantissa and + exponent. +* Create a dictionary encoding for float, double, and decimal. +* Create RLEv3: + * 64 and 128 bit variants --- End diff -- No, I mean that the new RLEv3 needs to support both 64 bit and 128 bit integers. The specification of course could just define the spec using 128 bits, but the Java and C++ implementation would likely be split. ---
[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/247#discussion_r182137684 --- Diff: site/specification/ORCv2.md --- @@ -0,0 +1,1032 @@ +--- +layout: page +title: Evolving Draft for ORC Specification v2 +--- + +This specification is rapidly evolving and should only be used for +developers on the project. + +# TO DO items + +The list of things that we plan to change: + +* Create a decimal representation with fixed scale using rle. +* Create a better float/double encoding that splits mantissa and + exponent. +* Create a dictionary encoding for float, double, and decimal. +* Create RLEv3: --- End diff -- I have some thoughts, but haven't done any work on it. ---
[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/247#discussion_r182137566 --- Diff: site/specification/ORCv2.md --- @@ -0,0 +1,1032 @@ +--- +layout: page +title: Evolving Draft for ORC Specification v2 +--- + +This specification is rapidly evolving and should only be used for +developers on the project. + +# TO DO items --- End diff -- This is absolutely a work in progress. I'd love to see your proposal. ---
[GitHub] orc pull request #248: ORC-337. Move to storage-api 2.5.0.
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/248 ORC-337. Move to storage-api 2.5.0. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-337 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/248.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #248 commit 74f11f8e0c5f44d0cd6a8a5218d151aabd3149c2 Author: Owen O'Malley <omalley@...> Date: 2018-04-13T23:16:32Z ORC-337. Move to storage-api 2.5.0. Signed-off-by: Owen O'Malley <omal...@apache.org> ---
[GitHub] orc issue #247: ORC-339. Reorganize the ORC file format specification.
Github user omalley commented on the issue: https://github.com/apache/orc/pull/247 Any other comments? ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181525137 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL_V1 and DECIMAL_V2 encodins are introduced and --- End diff -- In ORCv2, we'll just pick a RLE and not leave it pickable. In terms of the encoding names, I'm a bit torn. My original inclination would be to use DECIMAL64 and DECIMAL128 as encoding names. However, It would be nice to have the ability to use dictionaries, so we'd need dictionary forms of them too. Thoughts? ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181456234 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string representation is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + message Int128 { --- End diff -- Let's pull the Int128 out of DecimalStatistics. We will likely use it other places. One concern with this representation is that -1 is pretty painful. You'll get highBits = -1, lowBits = -1, which will only take 1 byte for highBits, but 9 bytes for lowBits (+ the 4 bytes of field identifiers & message length) = 14 bytes total. Another alternative is to use the zigzag encoding for the combined 128 bit value: optional uint64 minLow = 4; optional uint64 minHigh = 5; p <= 18: minLow = zigzag(min) minHigh = 0 p > 18: minLow = low bits of zigzag(min) minHigh = high bits of zigzag(min) That would have a representation of 1 byte each for minLow and minHigh + 2 bytes for field identifier = 4. If we leave the Int128 level that would add an additional + 2 bytes. ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181525462 --- Diff: site/_docs/encodings.md --- @@ -122,6 +132,12 @@ DIRECT| PRESENT | Yes | Boolean RLE DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +DECIMAL_V1| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | Yes | Signed Integer RLE v1 --- End diff -- We probably need a RLE128 that can just encode a int128 directly. Then we can just use the DATA stream directly. ---
[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/247#discussion_r181437863 --- Diff: site/specification/ORCv2.md --- @@ -0,0 +1,1032 @@ +--- +layout: page +title: Evolving Draft for ORC Specification v2 +--- + +This specification is rapidly evolving and should only be used for +developers on the project. + +# TO DO items + +The list of things that we plan to change: + +* Create a decimal representation with fixed scale using rle. +* Create a better float/double encoding that splits mantissa and + exponent. +* Create a dictionary encoding for float, double, and decimal. +* Create RLEv3: + * 64 and 128 bit variants + * Zero suppression + * Evaluate the rle subformats +* Group stripe data into stripelets to enable Async IO for reads. +* Reorder stripe data into (stripe metadata, index, dictionary, data) +* Stop sorting dictionaries and record the sort order separately in the index. +* Remove use of RLEv1 and RLEv2. +* Remove non-utf8 bloom filter. +* Use numeric value for decimal bloom filter. --- End diff -- Agreed ---
[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/247 ORC-339. Reorganize the ORC file format specification. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-339 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/247.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #247 commit 5c56d74d948a73f5c456e0e80ff0622505d6c1cf Author: Owen O'Malley <omalley@...> Date: 2018-04-12T22:03:00Z ORC-339. Reorganize the ORC file format specification. ---
[GitHub] orc pull request #246: ORC-338. Workaround C++ compiler bug in xcode 9.3 by ...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/246 ORC-338. Workaround C++ compiler bug in xcode 9.3 by removing an inline function. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-338 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/246.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #246 ---
[GitHub] orc pull request #243: Update the site with more information about developin...
Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/243#discussion_r180641849 --- Diff: site/develop/committers.md --- @@ -0,0 +1,62 @@ +--- +layout: page +title: Project Members +--- + +## Project Members + +{% comment %} +please sort by Apache Id +{% endcomment %} +Name| Apache Id| Role +:-- | :--- | :--- +Aliaksei Sandryhaila| asandryh | PMC +Chris Douglas | cdouglas | PMC +Chinna Rao Lalam| chinnaraol | Committer +Chaoyu Tang | ctang| Committer +Carl Steinbach | cws | Committer +Daniel Dai | daijy| Committer +Deepak Majeti | mdeepak | PMC +Eugene Koifman | ekoifman | PMC +Gang Wu | gangwu | Committer +Alan Gates | gates| PMC +Gopal Vijayaraghavan| gopalv | PMC +Gunther Hagleitner | gunther | Committer +Ashutosh Chauhan| hashutosh| Committer +Jesus Camacho Rodriguez | jcamacho | Committer +Jason Dere | jdere| Committer +Jimmy Xiang | jxiang | Committer +Kevin Wilfong | kevinwilfong | Committer +Lars Francke| larsfrancke | Committer +Lefty Leverenz | leftyl | PMC +Rui Li | lirui| Committer +Mithun Radhakrishnan| mithun | Committer +Matthew McCline | mmccline | Committer +Naveen Gangam | ngangam | Committer +Owen O'Malley | omalley | PMC +Prasanth Jayachandran | prasanthj| PMC +Pengcheng Xiong | pxiong | Committer +Rajesh Balamohan| rbalamohan | Committer +Sergey Shelukhin| sershe | Committer +Sergio Pena | spena| Committer +Siddharth Seth | sseth| Committer +Stephen Walkauskas | swalkaus | Committer +Vaibhav Gumashta| vgumashta| Committer +Wei Zheng | weiz | Committer +Xiening Dai | xndai| Committer +Xuefu Zhang | xuefu| Committer +Ferdinand Xu| xuf | Committer +Yongzhi Chen| ychena | Committer +Aihua Xu| zihuaxu | Committer + +Companies with employees that are committers: + +* Alibaba +* Cloudera +* Facebook +* Hewlett Packard Enterprise --- End diff -- Done ---
[GitHub] orc pull request #243: Update the site with more information about developin...
Github user omalley closed the pull request at: https://github.com/apache/orc/pull/243 ---
[GitHub] orc issue #243: Update the site with more information about developing.
Github user omalley commented on the issue: https://github.com/apache/orc/pull/243 I moved the committer list to a new page, because it made all of the other information get lost below the long list. I added sections on how to set up git, approve pull requests, create a gpg key, and the dist directory. I also added Xiening Dai and Gang Wu as committers. ---
[GitHub] orc pull request #243: Update the site with more information about developin...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/243 Update the site with more information about developing. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc dev-site Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/243.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #243 commit 079a23dc5cd1016a51be6eb5df80226c9ebc5a54 Author: Owen O'Malley <omalley@...> Date: 2018-04-10T22:36:51Z Update the site with more information about developing. Signed-off-by: Owen O'Malley <omal...@apache.org> ---