[jira] [Commented] (ORC-154) add OrcFile.WriterOptions.clone()
[ https://issues.apache.org/jira/browse/ORC-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930927#comment-15930927 ] Eugene Koifman commented on ORC-154: thanks > add OrcFile.WriterOptions.clone() > - > > Key: ORC-154 > URL: https://issues.apache.org/jira/browse/ORC-154 > Project: ORC > Issue Type: Improvement >Affects Versions: 1.3.3 >Reporter: Eugene Koifman >Assignee: Eugene Koifman > Fix For: 1.4.0 > > Attachments: ORC-154.01.patch > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ORC-166) add codec pool to ORC; make sure end is called on underlying codecs
[ https://issues.apache.org/jira/browse/ORC-166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930920#comment-15930920 ] Sergey Shelukhin commented on ORC-166: -- Will do, probably next week. Side note from some testing - codecs need to be reset before every decompress call (e.g. in ensureShim). Will add to patch eventually. > add codec pool to ORC; make sure end is called on underlying codecs > --- > > Key: ORC-166 > URL: https://issues.apache.org/jira/browse/ORC-166 > Project: ORC > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > Attachments: ORC-166.patch > > > Subj -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (ORC-154) add OrcFile.WriterOptions.clone()
[ https://issues.apache.org/jira/browse/ORC-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned ORC-154: - Assignee: Eugene Koifman > add OrcFile.WriterOptions.clone() > - > > Key: ORC-154 > URL: https://issues.apache.org/jira/browse/ORC-154 > Project: ORC > Issue Type: Improvement >Affects Versions: 1.3.3 >Reporter: Eugene Koifman >Assignee: Eugene Koifman > Fix For: 1.4.0 > > Attachments: ORC-154.01.patch > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ORC-165) add eclipse files to gitignore
[ https://issues.apache.org/jira/browse/ORC-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved ORC-165. --- Resolution: Fixed Fix Version/s: 1.4.0 I just committed this. Thanks, Sergey! > add eclipse files to gitignore > -- > > Key: ORC-165 > URL: https://issues.apache.org/jira/browse/ORC-165 > Project: ORC > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > Fix For: 1.4.0 > > Attachments: ORC-165.patch > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (ORC-166) add codec pool to ORC; make sure end is called on underlying codecs
[ https://issues.apache.org/jira/browse/ORC-166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated ORC-166: - Attachment: ORC-166.patch The patch. [~prasanth_j] [~owen.omalley] can you take a look? The problem is that end() call on codecs is not exposed, which causes native assets in direct codecs to leak until full GC. Those tend to accumulate a lot e.g. in LLAP, since codecs are created and forgotten a lot, e.g. in isAvailable method where the codec used for the check is forgotten. This changes codecs to be reusable and adds a pool, and also changes usage patterns in some places to facilitate closing them. > add codec pool to ORC; make sure end is called on underlying codecs > --- > > Key: ORC-166 > URL: https://issues.apache.org/jira/browse/ORC-166 > Project: ORC > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > Attachments: ORC-166.patch > > > Subj -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ORC-166) add codec pool to ORC; make sure end is called on underlying codecs
[ https://issues.apache.org/jira/browse/ORC-166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930690#comment-15930690 ] Sergey Shelukhin commented on ORC-166: -- cc [~rajesh.balamohan] [~gopalv] > add codec pool to ORC; make sure end is called on underlying codecs > --- > > Key: ORC-166 > URL: https://issues.apache.org/jira/browse/ORC-166 > Project: ORC > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > > Subj -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (ORC-166) add codec pool to ORC; make sure end is called on underlying codecs
[ https://issues.apache.org/jira/browse/ORC-166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin reassigned ORC-166: > add codec pool to ORC; make sure end is called on underlying codecs > --- > > Key: ORC-166 > URL: https://issues.apache.org/jira/browse/ORC-166 > Project: ORC > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > > Subj -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (ORC-165) add eclipse files to gitignore
[ https://issues.apache.org/jira/browse/ORC-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated ORC-165: - Attachment: ORC-165.patch [~owen.omalley] [~prasanth_j] can you take a look? > add eclipse files to gitignore > -- > > Key: ORC-165 > URL: https://issues.apache.org/jira/browse/ORC-165 > Project: ORC > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > Attachments: ORC-165.patch > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ORC-164) Decimal column encoding is documented incorrectly
[ https://issues.apache.org/jira/browse/ORC-164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930451#comment-15930451 ] Owen O'Malley commented on ORC-164: --- Actually, we fixed the text in ORC-139. We did miss the table. > Decimal column encoding is documented incorrectly > - > > Key: ORC-164 > URL: https://issues.apache.org/jira/browse/ORC-164 > Project: ORC > Issue Type: Bug > Components: documentation >Reporter: Douglas Drinka > Labels: documentation > > Relevant code: > {code:title=WriterImpl.java:DecimalTreeWriter|borderStyle=solid} > this.scaleStream = createIntegerWriter(writer.createStream(id, > OrcProto.Stream.Kind.SECONDARY), true, isDirectV2, writer); > {code} > The documentation states the Scale stream is unsigned, both in the > description and in the table. The code reads and writes this column as > signed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930387#comment-15930387 ] Owen O'Malley commented on ORC-161: --- When Hive first introduced decimal, the bounds weren't specified and varied by object. That is *really* problematic for sql, so since Hive 0.12 all of the decimals have had precision and scale specified in the type. Thus, although there is support for per object and scale, we can and should move to enforcing the scale and precision. Thus, we absolutely need an improved decimal encoding for ORC. It was hard to make big changes to ORC before Hive switched to using this implementation, but that is done now. If someone wants to work on this, it would be great. As you wrote above, using an encoding like longs would be great for values with precision <= 18. In particular, we should not encode the scale at all and force all of the values to use the scale from the type. Since we don't have a 128 bit rle, the longer precision decimals should probably be a pair of rle long streams. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ORC-157) Test failed due to timezone DST
[ https://issues.apache.org/jira/browse/ORC-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved ORC-157. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.4 I just committed this. > Test failed due to timezone DST > --- > > Key: ORC-157 > URL: https://issues.apache.org/jira/browse/ORC-157 > Project: ORC > Issue Type: Test > Components: tools >Affects Versions: 1.3.3 >Reporter: Andrey Morskoy >Assignee: Owen O'Malley >Priority: Trivial > Fix For: 1.3.4, 1.4.0 > > Attachments: CMakeError.log, CMakeOutput.log > > > To reproduce: > % mkdir build > > % cd build > > % cmake .. -DBUILD_JAVA=OFF > > % make package > > % make test-out > Output is: > [ FAILED ] TestMatchParam/FileParam.Contents/20, where GetParam() = > orc_split_elim.orc > 1 FAILED TEST > Sample line is: > /storage/progs/src/orc-rel-release-1.3.3/tools/test/TestMatch.cc:149: Failure > Value of: line > Actual: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, > \"decimal1\": 0.00, \"ts\": \"1969-12-31 17:04:10.0\"}" > Expected: expectedLine > Which is: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, > \"decimal1\": 0.00, \"ts\": \"1969-12-31 16:04:10.0\"}" > wrong output at row 24999 > so ts: 1 hour difference > I am at Ukraine (GMT+2 + DST). I could suppose, that as our DST change is at > the end of march (while US changes in the beginning of March AFAIK) - > something wrong with timestamp with timezone interpretation in this test - > and probably after Ukraine moves to DST - effect will disappear. > So there is 2 weeks windows when this test could FAIL -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ORC-157) Test failed due to timezone DST
[ https://issues.apache.org/jira/browse/ORC-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930277#comment-15930277 ] ASF GitHub Bot commented on ORC-157: Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/100 > Test failed due to timezone DST > --- > > Key: ORC-157 > URL: https://issues.apache.org/jira/browse/ORC-157 > Project: ORC > Issue Type: Test > Components: tools >Affects Versions: 1.3.3 >Reporter: Andrey Morskoy >Priority: Trivial > Attachments: CMakeError.log, CMakeOutput.log > > > To reproduce: > % mkdir build > > % cd build > > % cmake .. -DBUILD_JAVA=OFF > > % make package > > % make test-out > Output is: > [ FAILED ] TestMatchParam/FileParam.Contents/20, where GetParam() = > orc_split_elim.orc > 1 FAILED TEST > Sample line is: > /storage/progs/src/orc-rel-release-1.3.3/tools/test/TestMatch.cc:149: Failure > Value of: line > Actual: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, > \"decimal1\": 0.00, \"ts\": \"1969-12-31 17:04:10.0\"}" > Expected: expectedLine > Which is: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, > \"decimal1\": 0.00, \"ts\": \"1969-12-31 16:04:10.0\"}" > wrong output at row 24999 > so ts: 1 hour difference > I am at Ukraine (GMT+2 + DST). I could suppose, that as our DST change is at > the end of march (while US changes in the beginning of March AFAIK) - > something wrong with timestamp with timezone interpretation in this test - > and probably after Ukraine moves to DST - effect will disappear. > So there is 2 weeks windows when this test could FAIL -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ORC-157) Test failed due to timezone DST
[ https://issues.apache.org/jira/browse/ORC-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929650#comment-15929650 ] Andrey Morskoy commented on ORC-157: [~owen.omalley] Great - now tests are passed perfectly. Thanks. Please fill free to resolve an issue according to the flow. > Test failed due to timezone DST > --- > > Key: ORC-157 > URL: https://issues.apache.org/jira/browse/ORC-157 > Project: ORC > Issue Type: Test > Components: tools >Affects Versions: 1.3.3 >Reporter: Andrey Morskoy >Priority: Trivial > Attachments: CMakeError.log, CMakeOutput.log > > > To reproduce: > % mkdir build > > % cd build > > % cmake .. -DBUILD_JAVA=OFF > > % make package > > % make test-out > Output is: > [ FAILED ] TestMatchParam/FileParam.Contents/20, where GetParam() = > orc_split_elim.orc > 1 FAILED TEST > Sample line is: > /storage/progs/src/orc-rel-release-1.3.3/tools/test/TestMatch.cc:149: Failure > Value of: line > Actual: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, > \"decimal1\": 0.00, \"ts\": \"1969-12-31 17:04:10.0\"}" > Expected: expectedLine > Which is: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, > \"decimal1\": 0.00, \"ts\": \"1969-12-31 16:04:10.0\"}" > wrong output at row 24999 > so ts: 1 hour difference > I am at Ukraine (GMT+2 + DST). I could suppose, that as our DST change is at > the end of march (while US changes in the beginning of March AFAIK) - > something wrong with timestamp with timezone interpretation in this test - > and probably after Ukraine moves to DST - effect will disappear. > So there is 2 weeks windows when this test could FAIL -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ORC-157) Test failed due to timezone DST
[ https://issues.apache.org/jira/browse/ORC-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929648#comment-15929648 ] ASF GitHub Bot commented on ORC-157: Github user amorskoy commented on the issue: https://github.com/apache/orc/pull/100 Checked - tests passed now > Test failed due to timezone DST > --- > > Key: ORC-157 > URL: https://issues.apache.org/jira/browse/ORC-157 > Project: ORC > Issue Type: Test > Components: tools >Affects Versions: 1.3.3 >Reporter: Andrey Morskoy >Priority: Trivial > Attachments: CMakeError.log, CMakeOutput.log > > > To reproduce: > % mkdir build > > % cd build > > % cmake .. -DBUILD_JAVA=OFF > > % make package > > % make test-out > Output is: > [ FAILED ] TestMatchParam/FileParam.Contents/20, where GetParam() = > orc_split_elim.orc > 1 FAILED TEST > Sample line is: > /storage/progs/src/orc-rel-release-1.3.3/tools/test/TestMatch.cc:149: Failure > Value of: line > Actual: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, > \"decimal1\": 0.00, \"ts\": \"1969-12-31 17:04:10.0\"}" > Expected: expectedLine > Which is: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, > \"decimal1\": 0.00, \"ts\": \"1969-12-31 16:04:10.0\"}" > wrong output at row 24999 > so ts: 1 hour difference > I am at Ukraine (GMT+2 + DST). I could suppose, that as our DST change is at > the end of march (while US changes in the beginning of March AFAIK) - > something wrong with timestamp with timezone interpretation in this test - > and probably after Ukraine moves to DST - effect will disappear. > So there is 2 weeks windows when this test could FAIL -- This message was sent by Atlassian JIRA (v6.3.15#6346)