[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434909#comment-16434909 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180961097 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v2 --- End diff -- This would be hacky since we use int64_t and uint64_t to represent Int128 in C++. I can force to use signed integer RLE for uint64_t integers. Not sure if java can do the same thing. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434903#comment-16434903 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180959510 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,26 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string represetion is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + // for precision <= 18 + optional sint64 minimum64 = 4; + optional sint64 maximum64 = 5; + optional sint64 sum64 = 6; --- End diff -- Make sense, will combine them as sum128. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434902#comment-16434902 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180959364 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 --- End diff -- Sorry, a copy-and-paste mistake. Will fix it. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434900#comment-16434900 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180959255 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE --- End diff -- Yep. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434899#comment-16434899 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180959089 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- For decimals, current Integer RLE will be used. As you have explained, I agree that we should not use old RLE v1. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434894#comment-16434894 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180958085 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v2 --- End diff -- Moving sign bits to the lower bits has ease of use advantage (decimal64 is just suppressing the high bit stream) > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434892#comment-16434892 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180957943 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,26 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string represetion is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + // for precision <= 18 + optional sint64 minimum64 = 4; + optional sint64 maximum64 = 5; + optional sint64 sum64 = 6; --- End diff -- sum64 needs to be Int128 > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434890#comment-16434890 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180957684 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 --- End diff -- RLEv2 for all integers > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434889#comment-16434889 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180957651 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE --- End diff -- This is when Decimal v1 can be retired from the encodings. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434759#comment-16434759 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/245 @t3rmin4t0r @omalley @majetideepak @xndai Any suggestion or concern? If we can finalize this, I can start working on it. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434741#comment-16434741 ] ASF GitHub Bot commented on ORC-161: GitHub user wgtmac opened a pull request: https://github.com/apache/orc/pull/245 ORC-161: Proposal for new decimal encodings and statistics. New decimal encoding proposal is added into docs for discussion. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wgtmac/orc decimal Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/245.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #245 commit d7dd529fc44e27a40f22e92e74f8315b80994e5b Author: Gang WuDate: 2018-04-12T00:02:32Z Proposal for new decimal encodings and statistics. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ORC-281) Fix compiler warnings from clang 5.0
[ https://issues.apache.org/jira/browse/ORC-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated ORC-281: -- Fix Version/s: 1.4.4 > Fix compiler warnings from clang 5.0 > > > Key: ORC-281 > URL: https://issues.apache.org/jira/browse/ORC-281 > Project: ORC > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > Fix For: 1.4.4, 1.5.0 > > > We're currently getting: > {code} > [ 43%] Building CXX object c++/src/CMakeFiles/orc.dir/io/InputStream.cc.o > In file included from > /home/travis/build/apache/orc/c++/src/io/InputStream.cc:20: > /home/travis/build/apache/orc/c++/src/io/InputStream.hh:75:13: error: > '~SeekableArrayInputStream' overrides a destructor but is not marked > 'override' [-Werror,-Winconsistent-missing-destructor-override] > virtual ~SeekableArrayInputStream(); > ^ > /home/travis/build/apache/orc/c++/src/io/InputStream.hh:53:13: note: > overridden virtual function is here > virtual ~SeekableInputStream(); > ^ > /home/travis/build/apache/orc/c++/src/io/InputStream.hh:104:13: error: > '~SeekableFileInputStream' overrides a destructor but is not marked > 'override' [-Werror,-Winconsistent-missing-destructor-override] > virtual ~SeekableFileInputStream(); > ^ > /home/travis/build/apache/orc/c++/src/io/InputStream.hh:53:13: note: > overridden virtual function is here > virtual ~SeekableInputStream(); > ^ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-330) Remove unnecessary Hive artifacts from root pom
[ https://issues.apache.org/jira/browse/ORC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434568#comment-16434568 ] Daniel Voros commented on ORC-330: -- Thank you! > Remove unnecessary Hive artifacts from root pom > --- > > Key: ORC-330 > URL: https://issues.apache.org/jira/browse/ORC-330 > Project: ORC > Issue Type: Task > Components: Java >Affects Versions: 1.4.3 >Reporter: Daniel Voros >Assignee: Daniel Voros >Priority: Minor > Fix For: 1.4.4, 1.5.0 > > > dependencyManagement was defined for some Hive artifacts necessary for > benchmarking. Since those were moved out in ORC-298, these are no longer > (transitive) dependencies of Orc and might be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ORC-332) Add syntax version to orc_proto.proto
[ https://issues.apache.org/jira/browse/ORC-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved ORC-332. --- Resolution: Fixed Fix Version/s: 1.5.0 1.4.4 > Add syntax version to orc_proto.proto > - > > Key: ORC-332 > URL: https://issues.apache.org/jira/browse/ORC-332 > Project: ORC > Issue Type: Improvement >Reporter: rip.nsk >Assignee: rip.nsk >Priority: Trivial > Fix For: 1.4.4, 1.5.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ORC-330) Remove unnecessary Hive artifacts from root pom
[ https://issues.apache.org/jira/browse/ORC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved ORC-330. --- Resolution: Fixed Fix Version/s: 1.5.0 1.4.4 I just committed this. Thanks, Daniel! > Remove unnecessary Hive artifacts from root pom > --- > > Key: ORC-330 > URL: https://issues.apache.org/jira/browse/ORC-330 > Project: ORC > Issue Type: Task > Components: Java >Affects Versions: 1.4.3 >Reporter: Daniel Voros >Assignee: Daniel Voros >Priority: Minor > Fix For: 1.4.4, 1.5.0 > > > dependencyManagement was defined for some Hive artifacts necessary for > benchmarking. Since those were moved out in ORC-298, these are no longer > (transitive) dependencies of Orc and might be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ORC-336) Remove avro and parquet dependency management entries
[ https://issues.apache.org/jira/browse/ORC-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved ORC-336. --- Resolution: Fixed Fix Version/s: 1.5.0 1.4.4 > Remove avro and parquet dependency management entries > - > > Key: ORC-336 > URL: https://issues.apache.org/jira/browse/ORC-336 > Project: ORC > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > Fix For: 1.4.4, 1.5.0 > > > When we removed the benchmark code in ORC-298, we forgot to remove the > dependency management entries for Avro and Parquet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ORC-330) Remove unnecessary Hive artifacts from root pom
[ https://issues.apache.org/jira/browse/ORC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned ORC-330: - Assignee: Daniel Voros > Remove unnecessary Hive artifacts from root pom > --- > > Key: ORC-330 > URL: https://issues.apache.org/jira/browse/ORC-330 > Project: ORC > Issue Type: Task > Components: Java >Affects Versions: 1.4.3 >Reporter: Daniel Voros >Assignee: Daniel Voros >Priority: Minor > > dependencyManagement was defined for some Hive artifacts necessary for > benchmarking. Since those were moved out in ORC-298, these are no longer > (transitive) dependencies of Orc and might be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ORC-336) Remove avro and parquet dependency management entries
[ https://issues.apache.org/jira/browse/ORC-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned ORC-336: - > Remove avro and parquet dependency management entries > - > > Key: ORC-336 > URL: https://issues.apache.org/jira/browse/ORC-336 > Project: ORC > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > > When we removed the benchmark code in ORC-298, we forgot to remove the > dependency management entries for Avro and Parquet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)