[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443048#comment-16443048
 ] 

ASF GitHub Bot commented on ORC-161:


Github user xndai commented on the issue:

https://github.com/apache/orc/pull/245
  
+1 for Gang's proposal.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442991#comment-16442991
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on the issue:

https://github.com/apache/orc/pull/245
  
As ORC v2 specs may takes a long time to finalize, develop and test with a 
non-trivial structural change to be production-ready. We have a very large 
amount of data of decimal types in production awaiting this optimization. I'd 
propose to optimize the decimal encoding/statistics in 1.5 or probably 1.6. We 
can extend RLEv1 to support 128-bit integer with little effort. And replace it 
with RLEv3 in ORC 2.0. Will elaborate this in the next patch.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440485#comment-16440485
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181966485
  
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
 Decimal was introduced in Hive 0.11 with infinite precision (the total
 number of digits). In Hive 0.13, the definition was change to limit
 the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.
+bits plus a sign bit.
+
+DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
+representation of the value as an unbounded length zigzag encoded base
+128 varint. The scale is stored in the SECONDARY stream as an signed
+integer.
+
+In ORC 2.0, DECIMAL_V1 and DECIMAL_V2 encodins are introduced and
--- End diff --

We may also add dictionary encodings for integer and floating types. I 
think we can still use DICTIONARY_Vx and it has unique meaning for different 
data types.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440483#comment-16440483
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181965817
  
--- Diff: site/_docs/encodings.md ---
@@ -122,6 +132,12 @@ DIRECT| PRESENT | Yes  | Boolean 
RLE
 DIRECT_V2 | PRESENT | Yes  | Boolean RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
+DECIMAL_V1| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | Yes  | Signed Integer RLE v1
--- End diff --

We may use LEB128. Thoughts?


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440375#comment-16440375
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181950612
  
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
 Decimal was introduced in Hive 0.11 with infinite precision (the total
 number of digits). In Hive 0.13, the definition was change to limit
 the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.
+bits plus a sign bit.
+
+DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
+representation of the value as an unbounded length zigzag encoded base
+128 varint. The scale is stored in the SECONDARY stream as an signed
+integer.
+
+In ORC 2.0, DECIMAL_V1 and DECIMAL_V2 encodins are introduced and
--- End diff --

As I have said, I don't see any obvious benefit to use RLEv2 and abandon 
RLEv1 in our experiment. So I don't think it is a good idea not to provide an 
option to choose a RLE version.



> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440373#comment-16440373
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181950089
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,25 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string representation is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+  message Int128 {
--- End diff --

Good suggestion!


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438037#comment-16438037
 ] 

ASF GitHub Bot commented on ORC-161:


Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181456234
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,25 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string representation is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+  message Int128 {
--- End diff --

Let's pull the Int128 out of DecimalStatistics. We will likely use it other 
places.

One concern with this representation is that -1 is pretty painful. You'll 
get highBits = -1, lowBits = -1, which will only take 1 byte for highBits, but 
9 bytes for lowBits (+ the 4 bytes of field identifiers & message length) = 14 
bytes total. Another alternative is to use the zigzag encoding for the combined 
128 bit value:

  optional uint64 minLow = 4;
  optional uint64 minHigh = 5;

p <= 18:
  minLow = zigzag(min)
  minHigh = 0

p > 18:
  minLow = low bits of zigzag(min)
  minHigh = high bits of zigzag(min)

That would have a representation of 1 byte each for minLow and minHigh + 2 
bytes for field identifier = 4. If we leave the Int128 level that would add an 
additional + 2 bytes. 


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438039#comment-16438039
 ] 

ASF GitHub Bot commented on ORC-161:


Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181525137
  
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
 Decimal was introduced in Hive 0.11 with infinite precision (the total
 number of digits). In Hive 0.13, the definition was change to limit
 the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.
+bits plus a sign bit.
+
+DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
+representation of the value as an unbounded length zigzag encoded base
+128 varint. The scale is stored in the SECONDARY stream as an signed
+integer.
+
+In ORC 2.0, DECIMAL_V1 and DECIMAL_V2 encodins are introduced and
--- End diff --

In ORCv2, we'll just pick a RLE and not leave it pickable.

In terms of the encoding names, I'm a bit torn. My original inclination 
would be to use DECIMAL64 and DECIMAL128 as encoding names. However, It would 
be nice to have the ability to use dictionaries, so we'd need dictionary forms 
of them too. Thoughts?


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438038#comment-16438038
 ] 

ASF GitHub Bot commented on ORC-161:


Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181525462
  
--- Diff: site/_docs/encodings.md ---
@@ -122,6 +132,12 @@ DIRECT| PRESENT | Yes  | Boolean 
RLE
 DIRECT_V2 | PRESENT | Yes  | Boolean RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
+DECIMAL_V1| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | Yes  | Signed Integer RLE v1
--- End diff --

We probably need a RLE128 that can just encode a int128 directly. Then we 
can just use the DATA stream directly.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436270#comment-16436270
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on the issue:

https://github.com/apache/orc/pull/245
  
Will provide them after comprehensive benchmark.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436216#comment-16436216
 ] 

ASF GitHub Bot commented on ORC-161:


Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181203617
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

Some part of this discussion is about the new ORC format and existing 
reader compatibility is not a requirement, until we switch to the new format as 
a default.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436214#comment-16436214
 ] 

ASF GitHub Bot commented on ORC-161:


Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181202668
  
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
 Decimal was introduced in Hive 0.11 with infinite precision (the total
 number of digits). In Hive 0.13, the definition was change to limit
 the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.
+bits plus a sign bit.
+
+DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
+representation of the value as an unbounded length zigzag encoded base
+128 varint. The scale is stored in the SECONDARY stream as an signed
+integer.
+
+In ORC 2.0, DECIMAL encoding is introduced and totally remove scale
+stream as all decimal values use the same scale. When precision is
+no greater than 18, decimal values can be fully represented by DATA
+stream which stores 64-bit signed integers. When precision is greater
+than 18, we use a 128-bit signed integer to store the decimal value.
+DATA stream stores the higher 64 bits and SECONDARY stream holds the
+lower 64 bits. Both streams use signed integer RLE v2.
--- End diff --

The multiple-stream + row-group stride problems for IO were discussed by 
Owen.

The disk layout is what matters for IO, not the logical stream separation.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436183#comment-16436183
 ] 

ASF GitHub Bot commented on ORC-161:


Github user prasanthj commented on the issue:

https://github.com/apache/orc/pull/245
  
"we found RLEv1 + zstd may be the best combination than others in terms of 
both compression ration and encoding/decoding speed."

do you have experimental numbers for this?


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436156#comment-16436156
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on the issue:

https://github.com/apache/orc/pull/245
  
After second thought, I added back DECIMAL_V1 to support RLE v1 in decimal 
encoding. The reason is that in our testing, we found RLEv1 + zstd may be the 
best combination than others in terms of both compression ration and 
encoding/decoding speed.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436038#comment-16436038
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181168352
  
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
 Decimal was introduced in Hive 0.11 with infinite precision (the total
 number of digits). In Hive 0.13, the definition was change to limit
 the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.
+bits plus a sign bit.
+
+DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
+representation of the value as an unbounded length zigzag encoded base
+128 varint. The scale is stored in the SECONDARY stream as an signed
+integer.
+
+In ORC 2.0, DECIMAL encoding is introduced and totally remove scale
+stream as all decimal values use the same scale. When precision is
+no greater than 18, decimal values can be fully represented by DATA
+stream which stores 64-bit signed integers. When precision is greater
+than 18, we use a 128-bit signed integer to store the decimal value.
+DATA stream stores the higher 64 bits and SECONDARY stream holds the
+lower 64 bits. Both streams use signed integer RLE v2.
--- End diff --

The main problem is that we don't have 128-bit integer RLE on hand.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436017#comment-16436017
 ] 

ASF GitHub Bot commented on ORC-161:


Github user dain commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181164570
  
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
 Decimal was introduced in Hive 0.11 with infinite precision (the total
 number of digits). In Hive 0.13, the definition was change to limit
 the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.
+bits plus a sign bit.
+
+DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
+representation of the value as an unbounded length zigzag encoded base
+128 varint. The scale is stored in the SECONDARY stream as an signed
+integer.
+
+In ORC 2.0, DECIMAL encoding is introduced and totally remove scale
+stream as all decimal values use the same scale. When precision is
+no greater than 18, decimal values can be fully represented by DATA
+stream which stores 64-bit signed integers. When precision is greater
+than 18, we use a 128-bit signed integer to store the decimal value.
+DATA stream stores the higher 64 bits and SECONDARY stream holds the
+lower 64 bits. Both streams use signed integer RLE v2.
--- End diff --

Why split the data across two streams?  This means 2 IOs (or one large 
coalesced IO) to read the values (assuming no nulls).  Instead, can't we put 
all 128 bits in one stream?


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435979#comment-16435979
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181158484
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,25 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string representation is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+  message Int128 {
+   repeated sint64 highBits = 1;
+   repeated uint64 lowBits = 2;
--- End diff --

Here I was aligning with C++ orc::Int128's implementation to avoid many 
casts.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435974#comment-16435974
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181157700
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

@majetideepak We are already working on it and doing test & benchmark. Will 
contribute back  but may not be that soon.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435961#comment-16435961
 ] 

ASF GitHub Bot commented on ORC-161:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181155751
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

@xndai Vertica is interested in getting RLE v2 for C++ as well. Do you 
think we can collaborate on getting this in quickly?


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435904#comment-16435904
 ] 

ASF GitHub Bot commented on ORC-161:


Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181149073
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

I think we should keep RLE v1 as an option. The C++ writer currently does 
not support RLE v2 (we are working on it). We don't want the new decimal writer 
to have dependency on that.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435647#comment-16435647
 ] 

ASF GitHub Bot commented on ORC-161:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181096194
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,25 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string representation is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+  message Int128 {
+   repeated sint64 highBits = 1;
+   repeated uint64 lowBits = 2;
--- End diff --

shouldn't this be sint64 as well since we are using uint64 for the 
SECONDARY stream?


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434909#comment-16434909
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180961097
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v2
--- End diff --

This would be hacky since we use int64_t and uint64_t to represent Int128 
in C++. I can force to use signed integer RLE for uint64_t integers. Not sure 
if java can do the same thing.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434903#comment-16434903
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180959510
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,26 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string represetion is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+ // for precision <= 18 
+ optional sint64 minimum64 = 4;
+ optional sint64 maximum64 = 5;
+ optional sint64 sum64 = 6;
--- End diff --

Make sense, will combine them as sum128.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434902#comment-16434902
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180959364
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
--- End diff --

Sorry, a copy-and-paste mistake. Will fix it.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434900#comment-16434900
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180959255
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
--- End diff --

Yep.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434899#comment-16434899
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180959089
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

For decimals, current Integer RLE will be used. As you have explained, I 
agree that we should not use old RLE v1.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434894#comment-16434894
 ] 

ASF GitHub Bot commented on ORC-161:


Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180958085
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v2
--- End diff --

Moving sign bits to the lower bits has ease of use advantage (decimal64 is 
just suppressing the high bit stream)


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434892#comment-16434892
 ] 

ASF GitHub Bot commented on ORC-161:


Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180957943
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,26 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string represetion is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+ // for precision <= 18 
+ optional sint64 minimum64 = 4;
+ optional sint64 maximum64 = 5;
+ optional sint64 sum64 = 6;
--- End diff --

sum64 needs to be Int128 


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434890#comment-16434890
 ] 

ASF GitHub Bot commented on ORC-161:


Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180957684
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+  | SECONDARY   | No   | Unsigned Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
--- End diff --

RLEv2 for all integers 


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434889#comment-16434889
 ] 

ASF GitHub Bot commented on ORC-161:


Github user t3rmin4t0r commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r180957651
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
+
+### Decimal Encoding for precision > 18
+
+When precision is greater than 18, decimal value is split into two
+parts: a signed integer stores higher 64 bits and an unsigned integer
+stores lower 64 bits. Therefore, a DATA stream is utilized to store
+the higher 64-bit signed integer of decimal values and a SECONDARY
+stream holds the lower 64-bit unsigned integer of decimal values.
+Both streams use RLE and are not optional in this case.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
--- End diff --

This is when Decimal v1 can be retired from the encodings.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434759#comment-16434759
 ] 

ASF GitHub Bot commented on ORC-161:


Github user wgtmac commented on the issue:

https://github.com/apache/orc/pull/245
  
@t3rmin4t0r @omalley @majetideepak @xndai 
Any suggestion or concern? If we can finalize this, I can start working on 
it.


> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434741#comment-16434741
 ] 

ASF GitHub Bot commented on ORC-161:


GitHub user wgtmac opened a pull request:

https://github.com/apache/orc/pull/245

ORC-161: Proposal for new decimal encodings and statistics.

New decimal encoding proposal is added into docs for discussion.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wgtmac/orc decimal

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/245.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #245


commit d7dd529fc44e27a40f22e92e74f8315b80994e5b
Author: Gang Wu 
Date:   2018-04-12T00:02:32Z

Proposal for new decimal encodings and statistics.




> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2018-03-14 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398161#comment-16398161
 ] 

Gang Wu commented on ORC-161:
-

Recently I have done some benchmarks between ORC and our proprietary file 
format. The result indicates that ORC does not have a good performance on 
decimal type. From the aforementioned discussion in this JIRA, I have a 
proposal for adding a new encoding approach for decimal type (I don't think 
adding another kind of decimal type is a good choose which may confuse users). 
My proposal works as follows:
 # As Hive already has precision and scale specified in the type, we can 
totally remove the SECONDARY stream which stores scale of each element 
currently.
 # Since 128-bit integer is used to represent a decimal value and RLE supports 
at most 64-bit integer, we have two cases here.
 ** If precision <= 18, then the whole decimal value can be represented in a 
signed 64-bit integer. Therefore we only need a DATA stream and use signed 
integer RLE to encode it.
 ** If precision > 18, then we need to use a signed 128-bit integer. A solution 
is to use a signed 64-bit integer to hold higher 64 bits and an unsigned 64-bit 
integer to hold the lower 64 bits (C++ version is exactly doing the same 
thing). In this way, we can use DATA stream with signed integer RLE to store 
higher 64 bits and SECONDARY stream with unsigned integer RLE to store lower 64 
bits.
 # DecimalStatistics uses string type to store min/max/sum. We may also replace 
them with combination of sint64 and uint64 as above to represent a 128-bit 
integer. This can help save a lot of space.

Any thoughts? [~owen.omalley]

> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2017-03-17 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930387#comment-15930387
 ] 

Owen O'Malley commented on ORC-161:
---

When Hive first introduced decimal, the bounds weren't specified and varied by 
object. That is *really* problematic for sql, so since Hive 0.12 all of the 
decimals have had precision and scale specified in the type. Thus, although 
there is support for per object and scale, we can and should move to enforcing 
the scale and precision.

Thus, we absolutely need an improved decimal encoding for ORC. It was hard to 
make big changes to ORC before Hive switched to using this implementation, but 
that is done now.

If someone wants to work on this, it would be great. As you wrote above, using 
an encoding like longs would be great for values with precision <= 18. In 
particular, we should not encode the scale at all and force all of the values 
to use the scale from the type. Since we don't have a 128 bit rle, the longer 
precision decimals should probably be a pair of rle long streams.

> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

2017-03-16 Thread Douglas Drinka (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928428#comment-15928428
 ] 

Douglas Drinka commented on ORC-161:


Owen, if you think this is a something that would get accepted as a pull 
request, my employer would put some money towards a bounty to get the work done 
right away.  Do any of the contributors around here need a couple day paid 
project?  Is that appropriate to ask?

> Create a new column type that run-length-encodes decimals
> -
>
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
>  Issue Type: Wish
>  Components: encoding
>Reporter: Douglas Drinka
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)