[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-12 Thread wgtmac
Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r181239251
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
+   * 64 and 128 bit variants
+   * Zero suppression
+   * Evaluate the rle subformats
+* Group stripe data into stripelets to enable Async IO for reads.
+* Reorder stripe data into (stripe metadata, index, dictionary, data)
+* Stop sorting dictionaries and record the sort order separately in the 
index.
+* Remove use of RLEv1 and RLEv2.
+* Remove non-utf8 bloom filter.
+* Use numeric value for decimal bloom filter.
--- End diff --

We may also use numeric value for decimal column statistics


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-12 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/247

ORC-339. Reorganize the ORC file format specification.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-339

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/247.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #247


commit 5c56d74d948a73f5c456e0e80ff0622505d6c1cf
Author: Owen O'Malley 
Date:   2018-04-12T22:03:00Z

ORC-339. Reorganize the ORC file format specification.




---


[jira] [Created] (ORC-339) Reorganize ORC specification

2018-04-12 Thread Owen O'Malley (JIRA)
Owen O'Malley created ORC-339:
-

 Summary: Reorganize ORC specification
 Key: ORC-339
 URL: https://issues.apache.org/jira/browse/ORC-339
 Project: ORC
  Issue Type: Improvement
Reporter: Owen O'Malley
Assignee: Owen O'Malley


Currently we've put the ORC format specification in the documentation. Now that 
we are starting the work to design ORCv2, it will be more convenient to have 
each file format version as a separate page. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] orc issue #244: Add documentation for C++

2018-04-12 Thread wgtmac
Github user wgtmac commented on the issue:

https://github.com/apache/orc/pull/244
  
I just committed this. Thanks @majetideepak for review!


---


[GitHub] orc pull request #244: Add documentation for C++

2018-04-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/orc/pull/244


---


[GitHub] orc issue #245: ORC-161: Proposal for new decimal encodings and statistics.

2018-04-12 Thread wgtmac
Github user wgtmac commented on the issue:

https://github.com/apache/orc/pull/245
  
Will provide them after comprehensive benchmark.


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread t3rmin4t0r
Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181203617
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

Some part of this discussion is about the new ORC format and existing 
reader compatibility is not a requirement, until we switch to the new format as 
a default.


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread t3rmin4t0r
Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181202668
  
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
 Decimal was introduced in Hive 0.11 with infinite precision (the total
 number of digits). In Hive 0.13, the definition was change to limit
 the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.
+bits plus a sign bit.
+
+DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
+representation of the value as an unbounded length zigzag encoded base
+128 varint. The scale is stored in the SECONDARY stream as an signed
+integer.
+
+In ORC 2.0, DECIMAL encoding is introduced and totally remove scale
+stream as all decimal values use the same scale. When precision is
+no greater than 18, decimal values can be fully represented by DATA
+stream which stores 64-bit signed integers. When precision is greater
+than 18, we use a 128-bit signed integer to store the decimal value.
+DATA stream stores the higher 64 bits and SECONDARY stream holds the
+lower 64 bits. Both streams use signed integer RLE v2.
--- End diff --

The multiple-stream + row-group stride problems for IO were discussed by 
Owen.

The disk layout is what matters for IO, not the logical stream separation.


---


[GitHub] orc issue #245: ORC-161: Proposal for new decimal encodings and statistics.

2018-04-12 Thread prasanthj
Github user prasanthj commented on the issue:

https://github.com/apache/orc/pull/245
  
"we found RLEv1 + zstd may be the best combination than others in terms of 
both compression ration and encoding/decoding speed."

do you have experimental numbers for this?


---


[GitHub] orc issue #245: ORC-161: Proposal for new decimal encodings and statistics.

2018-04-12 Thread wgtmac
Github user wgtmac commented on the issue:

https://github.com/apache/orc/pull/245
  
After second thought, I added back DECIMAL_V1 to support RLE v1 in decimal 
encoding. The reason is that in our testing, we found RLEv1 + zstd may be the 
best combination than others in terms of both compression ration and 
encoding/decoding speed.


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread wgtmac
Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181168352
  
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
 Decimal was introduced in Hive 0.11 with infinite precision (the total
 number of digits). In Hive 0.13, the definition was change to limit
 the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.
+bits plus a sign bit.
+
+DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
+representation of the value as an unbounded length zigzag encoded base
+128 varint. The scale is stored in the SECONDARY stream as an signed
+integer.
+
+In ORC 2.0, DECIMAL encoding is introduced and totally remove scale
+stream as all decimal values use the same scale. When precision is
+no greater than 18, decimal values can be fully represented by DATA
+stream which stores 64-bit signed integers. When precision is greater
+than 18, we use a 128-bit signed integer to store the decimal value.
+DATA stream stores the higher 64 bits and SECONDARY stream holds the
+lower 64 bits. Both streams use signed integer RLE v2.
--- End diff --

The main problem is that we don't have 128-bit integer RLE on hand.


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread dain
Github user dain commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181164570
  
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
 Decimal was introduced in Hive 0.11 with infinite precision (the total
 number of digits). In Hive 0.13, the definition was change to limit
 the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.
+bits plus a sign bit.
+
+DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
+representation of the value as an unbounded length zigzag encoded base
+128 varint. The scale is stored in the SECONDARY stream as an signed
+integer.
+
+In ORC 2.0, DECIMAL encoding is introduced and totally remove scale
+stream as all decimal values use the same scale. When precision is
+no greater than 18, decimal values can be fully represented by DATA
+stream which stores 64-bit signed integers. When precision is greater
+than 18, we use a 128-bit signed integer to store the decimal value.
+DATA stream stores the higher 64 bits and SECONDARY stream holds the
+lower 64 bits. Both streams use signed integer RLE v2.
--- End diff --

Why split the data across two streams?  This means 2 IOs (or one large 
coalesced IO) to read the values (assuming no nulls).  Instead, can't we put 
all 128 bits in one stream?


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread wgtmac
Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181158484
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,25 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string representation is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+  message Int128 {
+   repeated sint64 highBits = 1;
+   repeated uint64 lowBits = 2;
--- End diff --

Here I was aligning with C++ orc::Int128's implementation to avoid many 
casts.


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread wgtmac
Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181157700
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

@majetideepak We are already working on it and doing test & benchmark. Will 
contribute back  but may not be that soon.


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181155751
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

@xndai Vertica is interested in getting RLE v2 for C++ as well. Do you 
think we can collaborate on getting this in quickly?


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread xndai
Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181149073
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

I think we should keep RLE v1 as an option. The C++ writer currently does 
not support RLE v2 (we are working on it). We don't want the new decimal writer 
to have dependency on that.


---


[GitHub] orc pull request #246: ORC-338. Workaround C++ compiler bug in xcode 9.3 by ...

2018-04-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/orc/pull/246


---


[GitHub] orc pull request #227: ORC-318. Change KeyProvider API to separate createLoc...

2018-04-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/orc/pull/227


---


[GitHub] orc pull request #246: ORC-338. Workaround C++ compiler bug in xcode 9.3 by ...

2018-04-12 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/246

ORC-338. Workaround C++ compiler bug in xcode 9.3 by removing an inline

function.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-338

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #246






---


[jira] [Created] (ORC-338) Workaround C++ compiler bug in newest clang including xcode 9.3

2018-04-12 Thread Owen O'Malley (JIRA)
Owen O'Malley created ORC-338:
-

 Summary: Workaround C++ compiler bug in newest clang including 
xcode 9.3
 Key: ORC-338
 URL: https://issues.apache.org/jira/browse/ORC-338
 Project: ORC
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley


The ColumnStatistics.intColumnStatistics test fails in the xcode 9.3 if you use 
the release build, but passes in the debug build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181096194
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,25 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string representation is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+  message Int128 {
+   repeated sint64 highBits = 1;
+   repeated uint64 lowBits = 2;
--- End diff --

shouldn't this be sint64 as well since we are using uint64 for the 
SECONDARY stream?


---