[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434909#comment-16434909
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180961097
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v2
--- End diff --

This would be hacky since we use int64_t and uint64_t to represent Int128 
in C++. I can force to use signed integer RLE for uint64_t integers. Not sure 
if java can do the same thing.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434903#comment-16434903
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180959510
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,26 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string represetion is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+ // for precision <= 18 
+ optional sint64 minimum64 = 4;
+ optional sint64 maximum64 = 5;
+ optional sint64 sum64 = 6;
--- End diff --

Make sense, will combine them as sum128.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434902#comment-16434902
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180959364
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
--- End diff --

Sorry, a copy-and-paste mistake. Will fix it.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434900#comment-16434900
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180959255
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
--- End diff --

Yep.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434899#comment-16434899
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180959089
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

For decimals, current Integer RLE will be used. As you have explained, I 
agree that we should not use old RLE v1.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434894#comment-16434894
 ] 

ASF GitHub Bot commented on ORC-161:


Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180958085
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v2
--- End diff --

Moving sign bits to the lower bits has ease of use advantage (decimal64 is 
just suppressing the high bit stream)


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434892#comment-16434892
 ] 

ASF GitHub Bot commented on ORC-161:


Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180957943
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,26 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string represetion is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+ // for precision <= 18 
+ optional sint64 minimum64 = 4;
+ optional sint64 maximum64 = 5;
+ optional sint64 sum64 = 6;
--- End diff --

sum64 needs to be Int128 


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434890#comment-16434890
 ] 

ASF GitHub Bot commented on ORC-161:


Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180957684
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
--- End diff --

RLEv2 for all integers 


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434889#comment-16434889
 ] 

ASF GitHub Bot commented on ORC-161:


Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180957651
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
--- End diff --

This is when Decimal v1 can be retired from the encodings.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434759#comment-16434759
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on the issue:

https://github.com/apache/orc/pull/245
  
@t3rmin4t0r @omalley @majetideepak @xndai 
Any suggestion or concern? If we can finalize this, I can start working on 
it.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434741#comment-16434741
 ] 

ASF GitHub Bot commented on ORC-161:


GitHub user wgtmac opened a pull request:

https://github.com/apache/orc/pull/245

ORC-161: Proposal for new decimal encodings and statistics.

New decimal encoding proposal is added into docs for discussion.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wgtmac/orc decimal

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/245.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #245


commit d7dd529fc44e27a40f22e92e74f8315b80994e5b
Author: Gang Wu 
Date:   2018-04-12T00:02:32Z

Proposal for new decimal encodings and statistics.




> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ORC-281) Fix compiler warnings from clang 5.0

2018-04-11 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated ORC-281:
--
Fix Version/s: 1.4.4

> Fix compiler warnings from clang 5.0
> 
>
> Key: ORC-281
> URL: https://issues.apache.org/jira/browse/ORC-281
> Project: ORC
>  Issue Type: Bug
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
> Fix For: 1.4.4, 1.5.0
>
>
> We're currently getting:
> {code}
> [ 43%] Building CXX object c++/src/CMakeFiles/orc.dir/io/InputStream.cc.o
> In file included from 
> /home/travis/build/apache/orc/c++/src/io/InputStream.cc:20:
> /home/travis/build/apache/orc/c++/src/io/InputStream.hh:75:13: error: 
> '~SeekableArrayInputStream' overrides a destructor but is not marked 
> 'override' [-Werror,-Winconsistent-missing-destructor-override]
> virtual ~SeekableArrayInputStream();
> ^
> /home/travis/build/apache/orc/c++/src/io/InputStream.hh:53:13: note: 
> overridden virtual function is here
> virtual ~SeekableInputStream();
> ^
> /home/travis/build/apache/orc/c++/src/io/InputStream.hh:104:13: error: 
> '~SeekableFileInputStream' overrides a destructor but is not marked 
> 'override' [-Werror,-Winconsistent-missing-destructor-override]
> virtual ~SeekableFileInputStream();
> ^
> /home/travis/build/apache/orc/c++/src/io/InputStream.hh:53:13: note: 
> overridden virtual function is here
> virtual ~SeekableInputStream();
> ^
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-330) Remove unnecessary Hive artifacts from root pom

2018-04-11 Thread Daniel Voros (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434568#comment-16434568
 ] 

Daniel Voros commented on ORC-330:
--

Thank you!

> Remove unnecessary Hive artifacts from root pom
> ---
>
> Key: ORC-330
> URL: https://issues.apache.org/jira/browse/ORC-330
> Project: ORC
>  Issue Type: Task
>  Components: Java
>Affects Versions: 1.4.3
>Reporter: Daniel Voros
>Assignee: Daniel Voros
>Priority: Minor
> Fix For: 1.4.4, 1.5.0
>
>
> dependencyManagement was defined for some Hive artifacts necessary for 
> benchmarking. Since those were moved out in ORC-298, these are no longer 
> (transitive) dependencies of Orc and might be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ORC-332) Add syntax version to orc_proto.proto

2018-04-11 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved ORC-332.
---
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.4

> Add syntax version to orc_proto.proto
> -
>
> Key: ORC-332
> URL: https://issues.apache.org/jira/browse/ORC-332
> Project: ORC
>  Issue Type: Improvement
>Reporter: rip.nsk
>Assignee: rip.nsk
>Priority: Trivial
> Fix For: 1.4.4, 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ORC-330) Remove unnecessary Hive artifacts from root pom

2018-04-11 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved ORC-330.
---
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.4

I just committed this. Thanks, Daniel!

> Remove unnecessary Hive artifacts from root pom
> ---
>
> Key: ORC-330
> URL: https://issues.apache.org/jira/browse/ORC-330
> Project: ORC
>  Issue Type: Task
>  Components: Java
>Affects Versions: 1.4.3
>Reporter: Daniel Voros
>Assignee: Daniel Voros
>Priority: Minor
> Fix For: 1.4.4, 1.5.0
>
>
> dependencyManagement was defined for some Hive artifacts necessary for 
> benchmarking. Since those were moved out in ORC-298, these are no longer 
> (transitive) dependencies of Orc and might be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ORC-336) Remove avro and parquet dependency management entries

2018-04-11 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved ORC-336.
---
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.4

> Remove avro and parquet dependency management entries
> -
>
> Key: ORC-336
> URL: https://issues.apache.org/jira/browse/ORC-336
> Project: ORC
>  Issue Type: Bug
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
> Fix For: 1.4.4, 1.5.0
>
>
> When we removed the benchmark code in ORC-298, we forgot to remove the 
> dependency management entries for Avro and Parquet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ORC-330) Remove unnecessary Hive artifacts from root pom

2018-04-11 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reassigned ORC-330:
-

Assignee: Daniel Voros

> Remove unnecessary Hive artifacts from root pom
> ---
>
> Key: ORC-330
> URL: https://issues.apache.org/jira/browse/ORC-330
> Project: ORC
>  Issue Type: Task
>  Components: Java
>Affects Versions: 1.4.3
>Reporter: Daniel Voros
>Assignee: Daniel Voros
>Priority: Minor
>
> dependencyManagement was defined for some Hive artifacts necessary for 
> benchmarking. Since those were moved out in ORC-298, these are no longer 
> (transitive) dependencies of Orc and might be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ORC-336) Remove avro and parquet dependency management entries

2018-04-11 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/ORC-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reassigned ORC-336:
-


> Remove avro and parquet dependency management entries
> -
>
> Key: ORC-336
> URL: https://issues.apache.org/jira/browse/ORC-336
> Project: ORC
>  Issue Type: Bug
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> When we removed the benchmark code in ORC-298, we forgot to remove the 
> dependency management entries for Avro and Parquet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)