[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443048#comment-16443048 ] ASF GitHub Bot commented on ORC-161: Github user xndai commented on the issue: https://github.com/apache/orc/pull/245 +1 for Gang's proposal. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442991#comment-16442991 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/245 As ORC v2 specs may takes a long time to finalize, develop and test with a non-trivial structural change to be production-ready. We have a very large amount of data of decimal types in production awaiting this optimization. I'd propose to optimize the decimal encoding/statistics in 1.5 or probably 1.6. We can extend RLEv1 to support 128-bit integer with little effort. And replace it with RLEv3 in ORC 2.0. Will elaborate this in the next patch. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440485#comment-16440485 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181966485 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL_V1 and DECIMAL_V2 encodins are introduced and --- End diff -- We may also add dictionary encodings for integer and floating types. I think we can still use DICTIONARY_Vx and it has unique meaning for different data types. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440483#comment-16440483 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181965817 --- Diff: site/_docs/encodings.md --- @@ -122,6 +132,12 @@ DIRECT| PRESENT | Yes | Boolean RLE DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +DECIMAL_V1| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | Yes | Signed Integer RLE v1 --- End diff -- We may use LEB128. Thoughts? > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440375#comment-16440375 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181950612 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL_V1 and DECIMAL_V2 encodins are introduced and --- End diff -- As I have said, I don't see any obvious benefit to use RLEv2 and abandon RLEv1 in our experiment. So I don't think it is a good idea not to provide an option to choose a RLE version. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440373#comment-16440373 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181950089 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string representation is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + message Int128 { --- End diff -- Good suggestion! > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438037#comment-16438037 ] ASF GitHub Bot commented on ORC-161: Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181456234 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string representation is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + message Int128 { --- End diff -- Let's pull the Int128 out of DecimalStatistics. We will likely use it other places. One concern with this representation is that -1 is pretty painful. You'll get highBits = -1, lowBits = -1, which will only take 1 byte for highBits, but 9 bytes for lowBits (+ the 4 bytes of field identifiers & message length) = 14 bytes total. Another alternative is to use the zigzag encoding for the combined 128 bit value: optional uint64 minLow = 4; optional uint64 minHigh = 5; p <= 18: minLow = zigzag(min) minHigh = 0 p > 18: minLow = low bits of zigzag(min) minHigh = high bits of zigzag(min) That would have a representation of 1 byte each for minLow and minHigh + 2 bytes for field identifier = 4. If we leave the Int128 level that would add an additional + 2 bytes. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438039#comment-16438039 ] ASF GitHub Bot commented on ORC-161: Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181525137 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL_V1 and DECIMAL_V2 encodins are introduced and --- End diff -- In ORCv2, we'll just pick a RLE and not leave it pickable. In terms of the encoding names, I'm a bit torn. My original inclination would be to use DECIMAL64 and DECIMAL128 as encoding names. However, It would be nice to have the ability to use dictionaries, so we'd need dictionary forms of them too. Thoughts? > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438038#comment-16438038 ] ASF GitHub Bot commented on ORC-161: Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181525462 --- Diff: site/_docs/encodings.md --- @@ -122,6 +132,12 @@ DIRECT| PRESENT | Yes | Boolean RLE DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +DECIMAL_V1| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | Yes | Signed Integer RLE v1 --- End diff -- We probably need a RLE128 that can just encode a int128 directly. Then we can just use the DATA stream directly. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436270#comment-16436270 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/245 Will provide them after comprehensive benchmark. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436216#comment-16436216 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181203617 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- Some part of this discussion is about the new ORC format and existing reader compatibility is not a requirement, until we switch to the new format as a default. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436214#comment-16436214 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181202668 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL encoding is introduced and totally remove scale +stream as all decimal values use the same scale. When precision is +no greater than 18, decimal values can be fully represented by DATA +stream which stores 64-bit signed integers. When precision is greater +than 18, we use a 128-bit signed integer to store the decimal value. +DATA stream stores the higher 64 bits and SECONDARY stream holds the +lower 64 bits. Both streams use signed integer RLE v2. --- End diff -- The multiple-stream + row-group stride problems for IO were discussed by Owen. The disk layout is what matters for IO, not the logical stream separation. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436183#comment-16436183 ] ASF GitHub Bot commented on ORC-161: Github user prasanthj commented on the issue: https://github.com/apache/orc/pull/245 "we found RLEv1 + zstd may be the best combination than others in terms of both compression ration and encoding/decoding speed." do you have experimental numbers for this? > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436156#comment-16436156 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/245 After second thought, I added back DECIMAL_V1 to support RLE v1 in decimal encoding. The reason is that in our testing, we found RLEv1 + zstd may be the best combination than others in terms of both compression ration and encoding/decoding speed. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436038#comment-16436038 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181168352 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL encoding is introduced and totally remove scale +stream as all decimal values use the same scale. When precision is +no greater than 18, decimal values can be fully represented by DATA +stream which stores 64-bit signed integers. When precision is greater +than 18, we use a 128-bit signed integer to store the decimal value. +DATA stream stores the higher 64 bits and SECONDARY stream holds the +lower 64 bits. Both streams use signed integer RLE v2. --- End diff -- The main problem is that we don't have 128-bit integer RLE on hand. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436017#comment-16436017 ] ASF GitHub Bot commented on ORC-161: Github user dain commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181164570 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL encoding is introduced and totally remove scale +stream as all decimal values use the same scale. When precision is +no greater than 18, decimal values can be fully represented by DATA +stream which stores 64-bit signed integers. When precision is greater +than 18, we use a 128-bit signed integer to store the decimal value. +DATA stream stores the higher 64 bits and SECONDARY stream holds the +lower 64 bits. Both streams use signed integer RLE v2. --- End diff -- Why split the data across two streams? This means 2 IOs (or one large coalesced IO) to read the values (assuming no nulls). Instead, can't we put all 128 bits in one stream? > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435979#comment-16435979 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181158484 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string representation is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + message Int128 { + repeated sint64 highBits = 1; + repeated uint64 lowBits = 2; --- End diff -- Here I was aligning with C++ orc::Int128's implementation to avoid many casts. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435974#comment-16435974 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181157700 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- @majetideepak We are already working on it and doing test & benchmark. Will contribute back but may not be that soon. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435961#comment-16435961 ] ASF GitHub Bot commented on ORC-161: Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181155751 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- @xndai Vertica is interested in getting RLE v2 for C++ as well. Do you think we can collaborate on getting this in quickly? > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435904#comment-16435904 ] ASF GitHub Bot commented on ORC-161: Github user xndai commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181149073 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- I think we should keep RLE v1 as an option. The C++ writer currently does not support RLE v2 (we are working on it). We don't want the new decimal writer to have dependency on that. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435647#comment-16435647 ] ASF GitHub Bot commented on ORC-161: Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181096194 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string representation is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + message Int128 { + repeated sint64 highBits = 1; + repeated uint64 lowBits = 2; --- End diff -- shouldn't this be sint64 as well since we are using uint64 for the SECONDARY stream? > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434909#comment-16434909 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180961097 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v2 --- End diff -- This would be hacky since we use int64_t and uint64_t to represent Int128 in C++. I can force to use signed integer RLE for uint64_t integers. Not sure if java can do the same thing. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434903#comment-16434903 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180959510 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,26 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string represetion is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + // for precision <= 18 + optional sint64 minimum64 = 4; + optional sint64 maximum64 = 5; + optional sint64 sum64 = 6; --- End diff -- Make sense, will combine them as sum128. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434902#comment-16434902 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180959364 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 --- End diff -- Sorry, a copy-and-paste mistake. Will fix it. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434900#comment-16434900 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180959255 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE --- End diff -- Yep. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434899#comment-16434899 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180959089 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- For decimals, current Integer RLE will be used. As you have explained, I agree that we should not use old RLE v1. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434894#comment-16434894 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180958085 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v2 --- End diff -- Moving sign bits to the lower bits has ease of use advantage (decimal64 is just suppressing the high bit stream) > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434892#comment-16434892 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180957943 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,26 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string represetion is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + // for precision <= 18 + optional sint64 minimum64 = 4; + optional sint64 maximum64 = 5; + optional sint64 sum64 = 6; --- End diff -- sum64 needs to be Int128 > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434890#comment-16434890 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180957684 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 --- End diff -- RLEv2 for all integers > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434889#comment-16434889 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180957651 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE --- End diff -- This is when Decimal v1 can be retired from the encodings. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434759#comment-16434759 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/245 @t3rmin4t0r @omalley @majetideepak @xndai Any suggestion or concern? If we can finalize this, I can start working on it. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434741#comment-16434741 ] ASF GitHub Bot commented on ORC-161: GitHub user wgtmac opened a pull request: https://github.com/apache/orc/pull/245 ORC-161: Proposal for new decimal encodings and statistics. New decimal encoding proposal is added into docs for discussion. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wgtmac/orc decimal Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/245.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #245 commit d7dd529fc44e27a40f22e92e74f8315b80994e5b Author: Gang WuDate: 2018-04-12T00:02:32Z Proposal for new decimal encodings and statistics. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398161#comment-16398161 ] Gang Wu commented on ORC-161: - Recently I have done some benchmarks between ORC and our proprietary file format. The result indicates that ORC does not have a good performance on decimal type. From the aforementioned discussion in this JIRA, I have a proposal for adding a new encoding approach for decimal type (I don't think adding another kind of decimal type is a good choose which may confuse users). My proposal works as follows: # As Hive already has precision and scale specified in the type, we can totally remove the SECONDARY stream which stores scale of each element currently. # Since 128-bit integer is used to represent a decimal value and RLE supports at most 64-bit integer, we have two cases here. ** If precision <= 18, then the whole decimal value can be represented in a signed 64-bit integer. Therefore we only need a DATA stream and use signed integer RLE to encode it. ** If precision > 18, then we need to use a signed 128-bit integer. A solution is to use a signed 64-bit integer to hold higher 64 bits and an unsigned 64-bit integer to hold the lower 64 bits (C++ version is exactly doing the same thing). In this way, we can use DATA stream with signed integer RLE to store higher 64 bits and SECONDARY stream with unsigned integer RLE to store lower 64 bits. # DecimalStatistics uses string type to store min/max/sum. We may also replace them with combination of sint64 and uint64 as above to represent a 128-bit integer. This can help save a lot of space. Any thoughts? [~owen.omalley] > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930387#comment-15930387 ] Owen O'Malley commented on ORC-161: --- When Hive first introduced decimal, the bounds weren't specified and varied by object. That is *really* problematic for sql, so since Hive 0.12 all of the decimals have had precision and scale specified in the type. Thus, although there is support for per object and scale, we can and should move to enforcing the scale and precision. Thus, we absolutely need an improved decimal encoding for ORC. It was hard to make big changes to ORC before Hive switched to using this implementation, but that is done now. If someone wants to work on this, it would be great. As you wrote above, using an encoding like longs would be great for values with precision <= 18. In particular, we should not encode the scale at all and force all of the values to use the scale from the type. Since we don't have a 128 bit rle, the longer precision decimals should probably be a pair of rle long streams. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928428#comment-15928428 ] Douglas Drinka commented on ORC-161: Owen, if you think this is a something that would get accepted as a pull request, my employer would put some money towards a bounty to get the work done right away. Do any of the contributors around here need a couple day paid project? Is that appropriate to ask? > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v6.3.15#6346)