[jira] [Commented] (ORC-154) add OrcFile.WriterOptions.clone()

2017-03-17 Thread Eugene Koifman (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930927#comment-15930927
 ] 

Eugene Koifman commented on ORC-154:


thanks

> add OrcFile.WriterOptions.clone()
> -
>
> Key: ORC-154
> URL: https://issues.apache.org/jira/browse/ORC-154
> Project: ORC
>  Issue Type: Improvement
>Affects Versions: 1.3.3
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
> Fix For: 1.4.0
>
> Attachments: ORC-154.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ORC-166) add codec pool to ORC; make sure end is called on underlying codecs

2017-03-17 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930920#comment-15930920
 ] 

Sergey Shelukhin commented on ORC-166:
--

Will do, probably next week. Side note from some testing - codecs need to be 
reset before every decompress call (e.g. in ensureShim). Will add to patch 
eventually.

> add codec pool to ORC; make sure end is called on underlying codecs
> ---
>
> Key: ORC-166
> URL: https://issues.apache.org/jira/browse/ORC-166
> Project: ORC
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: ORC-166.patch
>
>
> Subj



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ORC-154) add OrcFile.WriterOptions.clone()

2017-03-17 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reassigned ORC-154:
-

Assignee: Eugene Koifman

> add OrcFile.WriterOptions.clone()
> -
>
> Key: ORC-154
> URL: https://issues.apache.org/jira/browse/ORC-154
> Project: ORC
>  Issue Type: Improvement
>Affects Versions: 1.3.3
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
> Fix For: 1.4.0
>
> Attachments: ORC-154.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ORC-165) add eclipse files to gitignore

2017-03-17 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved ORC-165.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

I just committed this. Thanks, Sergey!

> add eclipse files to gitignore
> --
>
> Key: ORC-165
> URL: https://issues.apache.org/jira/browse/ORC-165
> Project: ORC
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Fix For: 1.4.0
>
> Attachments: ORC-165.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ORC-166) add codec pool to ORC; make sure end is called on underlying codecs

2017-03-17 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated ORC-166:
-
Attachment: ORC-166.patch

The patch. [~prasanth_j] [~owen.omalley] can you take a look?
The problem is that end() call on codecs is not exposed, which causes native 
assets in direct codecs to leak until full GC. Those tend to accumulate a lot 
e.g. in LLAP, since codecs are created and forgotten a lot, e.g. in isAvailable 
method where the codec used for the check is forgotten.
This changes codecs to be reusable and adds a pool, and also changes usage 
patterns in some places to facilitate closing them. 

> add codec pool to ORC; make sure end is called on underlying codecs
> ---
>
> Key: ORC-166
> URL: https://issues.apache.org/jira/browse/ORC-166
> Project: ORC
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: ORC-166.patch
>
>
> Subj



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ORC-166) add codec pool to ORC; make sure end is called on underlying codecs

2017-03-17 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930690#comment-15930690
 ] 

Sergey Shelukhin commented on ORC-166:
--

cc [~rajesh.balamohan] [~gopalv]

> add codec pool to ORC; make sure end is called on underlying codecs
> ---
>
> Key: ORC-166
> URL: https://issues.apache.org/jira/browse/ORC-166
> Project: ORC
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Subj



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ORC-166) add codec pool to ORC; make sure end is called on underlying codecs

2017-03-17 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin reassigned ORC-166:



> add codec pool to ORC; make sure end is called on underlying codecs
> ---
>
> Key: ORC-166
> URL: https://issues.apache.org/jira/browse/ORC-166
> Project: ORC
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Subj



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ORC-165) add eclipse files to gitignore

2017-03-17 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated ORC-165:
-
Attachment: ORC-165.patch

[~owen.omalley] [~prasanth_j] can you take a look?

> add eclipse files to gitignore
> --
>
> Key: ORC-165
> URL: https://issues.apache.org/jira/browse/ORC-165
> Project: ORC
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: ORC-165.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ORC-164) Decimal column encoding is documented incorrectly

2017-03-17 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930451#comment-15930451
 ] 

Owen O'Malley commented on ORC-164:
---

Actually, we fixed the text in ORC-139. We did miss the table.

> Decimal column encoding is documented incorrectly
> -
>
> Key: ORC-164
> URL: https://issues.apache.org/jira/browse/ORC-164
> Project: ORC
>  Issue Type: Bug
>  Components: documentation
>Reporter: Douglas Drinka
>  Labels: documentation
>
> Relevant code:
> {code:title=WriterImpl.java:DecimalTreeWriter|borderStyle=solid}
> this.scaleStream = createIntegerWriter(writer.createStream(id,
>   OrcProto.Stream.Kind.SECONDARY), true, isDirectV2, writer);
> {code}
> The documentation states the Scale stream is unsigned, both in the 
> description and in the table.  The code reads and writes this column as 
> signed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2017-03-17 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930387#comment-15930387
 ] 

Owen O'Malley commented on ORC-161:
---

When Hive first introduced decimal, the bounds weren't specified and varied by 
object. That is *really* problematic for sql, so since Hive 0.12 all of the 
decimals have had precision and scale specified in the type. Thus, although 
there is support for per object and scale, we can and should move to enforcing 
the scale and precision.

Thus, we absolutely need an improved decimal encoding for ORC. It was hard to 
make big changes to ORC before Hive switched to using this implementation, but 
that is done now.

If someone wants to work on this, it would be great. As you wrote above, using 
an encoding like longs would be great for values with precision <= 18. In 
particular, we should not encode the scale at all and force all of the values 
to use the scale from the type. Since we don't have a 128 bit rle, the longer 
precision decimals should probably be a pair of rle long streams.

> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ORC-157) Test failed due to timezone DST

2017-03-17 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved ORC-157.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.4

I just committed this.

> Test failed due to timezone DST
> ---
>
> Key: ORC-157
> URL: https://issues.apache.org/jira/browse/ORC-157
> Project: ORC
>  Issue Type: Test
>  Components: tools
>Affects Versions: 1.3.3
>Reporter: Andrey Morskoy
>Assignee: Owen O'Malley
>Priority: Trivial
> Fix For: 1.3.4, 1.4.0
>
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> To reproduce:
> % mkdir build 
> 
> % cd build
> 
> % cmake .. -DBUILD_JAVA=OFF   
> 
> % make package
> 
> % make test-out 
> Output is:
> [  FAILED  ] TestMatchParam/FileParam.Contents/20, where GetParam() = 
> orc_split_elim.orc
>  1 FAILED TEST
> Sample line is:
> /storage/progs/src/orc-rel-release-1.3.3/tools/test/TestMatch.cc:149: Failure
> Value of: line
>   Actual: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, 
> \"decimal1\": 0.00, \"ts\": \"1969-12-31 17:04:10.0\"}"
> Expected: expectedLine
> Which is: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, 
> \"decimal1\": 0.00, \"ts\": \"1969-12-31 16:04:10.0\"}"
> wrong output at row 24999
> so ts: 1 hour difference
> I am at Ukraine (GMT+2 + DST). I could suppose, that as our DST change is at 
> the end of march (while US changes in the beginning of March AFAIK) - 
> something wrong with timestamp with timezone interpretation in this test - 
> and probably after Ukraine moves to DST - effect will disappear. 
> So there is 2 weeks windows when this test could FAIL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ORC-157) Test failed due to timezone DST

2017-03-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930277#comment-15930277
 ] 

ASF GitHub Bot commented on ORC-157:


Github user asfgit closed the pull request at:

https://github.com/apache/orc/pull/100


> Test failed due to timezone DST
> ---
>
> Key: ORC-157
> URL: https://issues.apache.org/jira/browse/ORC-157
> Project: ORC
>  Issue Type: Test
>  Components: tools
>Affects Versions: 1.3.3
>Reporter: Andrey Morskoy
>Priority: Trivial
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> To reproduce:
> % mkdir build 
> 
> % cd build
> 
> % cmake .. -DBUILD_JAVA=OFF   
> 
> % make package
> 
> % make test-out 
> Output is:
> [  FAILED  ] TestMatchParam/FileParam.Contents/20, where GetParam() = 
> orc_split_elim.orc
>  1 FAILED TEST
> Sample line is:
> /storage/progs/src/orc-rel-release-1.3.3/tools/test/TestMatch.cc:149: Failure
> Value of: line
>   Actual: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, 
> \"decimal1\": 0.00, \"ts\": \"1969-12-31 17:04:10.0\"}"
> Expected: expectedLine
> Which is: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, 
> \"decimal1\": 0.00, \"ts\": \"1969-12-31 16:04:10.0\"}"
> wrong output at row 24999
> so ts: 1 hour difference
> I am at Ukraine (GMT+2 + DST). I could suppose, that as our DST change is at 
> the end of march (while US changes in the beginning of March AFAIK) - 
> something wrong with timestamp with timezone interpretation in this test - 
> and probably after Ukraine moves to DST - effect will disappear. 
> So there is 2 weeks windows when this test could FAIL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ORC-157) Test failed due to timezone DST

2017-03-17 Thread Andrey Morskoy (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929650#comment-15929650
 ] 

Andrey Morskoy commented on ORC-157:


[~owen.omalley] Great - now tests are passed perfectly. Thanks. 
Please fill free to resolve an issue according to the flow.

> Test failed due to timezone DST
> ---
>
> Key: ORC-157
> URL: https://issues.apache.org/jira/browse/ORC-157
> Project: ORC
>  Issue Type: Test
>  Components: tools
>Affects Versions: 1.3.3
>Reporter: Andrey Morskoy
>Priority: Trivial
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> To reproduce:
> % mkdir build 
> 
> % cd build
> 
> % cmake .. -DBUILD_JAVA=OFF   
> 
> % make package
> 
> % make test-out 
> Output is:
> [  FAILED  ] TestMatchParam/FileParam.Contents/20, where GetParam() = 
> orc_split_elim.orc
>  1 FAILED TEST
> Sample line is:
> /storage/progs/src/orc-rel-release-1.3.3/tools/test/TestMatch.cc:149: Failure
> Value of: line
>   Actual: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, 
> \"decimal1\": 0.00, \"ts\": \"1969-12-31 17:04:10.0\"}"
> Expected: expectedLine
> Which is: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, 
> \"decimal1\": 0.00, \"ts\": \"1969-12-31 16:04:10.0\"}"
> wrong output at row 24999
> so ts: 1 hour difference
> I am at Ukraine (GMT+2 + DST). I could suppose, that as our DST change is at 
> the end of march (while US changes in the beginning of March AFAIK) - 
> something wrong with timestamp with timezone interpretation in this test - 
> and probably after Ukraine moves to DST - effect will disappear. 
> So there is 2 weeks windows when this test could FAIL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ORC-157) Test failed due to timezone DST

2017-03-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929648#comment-15929648
 ] 

ASF GitHub Bot commented on ORC-157:


Github user amorskoy commented on the issue:

https://github.com/apache/orc/pull/100
  
Checked - tests passed now


> Test failed due to timezone DST
> ---
>
> Key: ORC-157
> URL: https://issues.apache.org/jira/browse/ORC-157
> Project: ORC
>  Issue Type: Test
>  Components: tools
>Affects Versions: 1.3.3
>Reporter: Andrey Morskoy
>Priority: Trivial
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> To reproduce:
> % mkdir build 
> 
> % cd build
> 
> % cmake .. -DBUILD_JAVA=OFF   
> 
> % make package
> 
> % make test-out 
> Output is:
> [  FAILED  ] TestMatchParam/FileParam.Contents/20, where GetParam() = 
> orc_split_elim.orc
>  1 FAILED TEST
> Sample line is:
> /storage/progs/src/orc-rel-release-1.3.3/tools/test/TestMatch.cc:149: Failure
> Value of: line
>   Actual: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, 
> \"decimal1\": 0.00, \"ts\": \"1969-12-31 17:04:10.0\"}"
> Expected: expectedLine
> Which is: "{\"userid\": 100, \"string1\": \"zebra\", \"subtype\": 8, 
> \"decimal1\": 0.00, \"ts\": \"1969-12-31 16:04:10.0\"}"
> wrong output at row 24999
> so ts: 1 hour difference
> I am at Ukraine (GMT+2 + DST). I could suppose, that as our DST change is at 
> the end of march (while US changes in the beginning of March AFAIK) - 
> something wrong with timestamp with timezone interpretation in this test - 
> and probably after Ukraine moves to DST - effect will disappear. 
> So there is 2 weeks windows when this test could FAIL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)