[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435647#comment-16435647 ] ASF GitHub Bot commented on ORC-161: Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181096194 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string representation is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + message Int128 { + repeated sint64 highBits = 1; + repeated uint64 lowBits = 2; --- End diff -- shouldn't this be sint64 as well since we are using uint64 for the SECONDARY stream? > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ORC-338) Workaround C++ compiler bug in newest clang including xcode 9.3
[ https://issues.apache.org/jira/browse/ORC-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned ORC-338: - > Workaround C++ compiler bug in newest clang including xcode 9.3 > --- > > Key: ORC-338 > URL: https://issues.apache.org/jira/browse/ORC-338 > Project: ORC > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > > The ColumnStatistics.intColumnStatistics test fails in the xcode 9.3 if you > use the release build, but passes in the debug build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-338) Workaround C++ compiler bug in newest clang including xcode 9.3
[ https://issues.apache.org/jira/browse/ORC-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435704#comment-16435704 ] ASF GitHub Bot commented on ORC-338: GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/246 ORC-338. Workaround C++ compiler bug in xcode 9.3 by removing an inline function. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-338 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/246.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #246 > Workaround C++ compiler bug in newest clang including xcode 9.3 > --- > > Key: ORC-338 > URL: https://issues.apache.org/jira/browse/ORC-338 > Project: ORC > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > > The ColumnStatistics.intColumnStatistics test fails in the xcode 9.3 if you > use the release build, but passes in the debug build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435904#comment-16435904 ] ASF GitHub Bot commented on ORC-161: Github user xndai commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181149073 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- I think we should keep RLE v1 as an option. The C++ writer currently does not support RLE v2 (we are working on it). We don't want the new decimal writer to have dependency on that. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435961#comment-16435961 ] ASF GitHub Bot commented on ORC-161: Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181155751 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- @xndai Vertica is interested in getting RLE v2 for C++ as well. Do you think we can collaborate on getting this in quickly? > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435974#comment-16435974 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181157700 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- @majetideepak We are already working on it and doing test & benchmark. Will contribute back but may not be that soon. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435979#comment-16435979 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181158484 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string representation is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + message Int128 { + repeated sint64 highBits = 1; + repeated uint64 lowBits = 2; --- End diff -- Here I was aligning with C++ orc::Int128's implementation to avoid many casts. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436017#comment-16436017 ] ASF GitHub Bot commented on ORC-161: Github user dain commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181164570 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL encoding is introduced and totally remove scale +stream as all decimal values use the same scale. When precision is +no greater than 18, decimal values can be fully represented by DATA +stream which stores 64-bit signed integers. When precision is greater +than 18, we use a 128-bit signed integer to store the decimal value. +DATA stream stores the higher 64 bits and SECONDARY stream holds the +lower 64 bits. Both streams use signed integer RLE v2. --- End diff -- Why split the data across two streams? This means 2 IOs (or one large coalesced IO) to read the values (assuming no nulls). Instead, can't we put all 128 bits in one stream? > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436038#comment-16436038 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181168352 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL encoding is introduced and totally remove scale +stream as all decimal values use the same scale. When precision is +no greater than 18, decimal values can be fully represented by DATA +stream which stores 64-bit signed integers. When precision is greater +than 18, we use a 128-bit signed integer to store the decimal value. +DATA stream stores the higher 64 bits and SECONDARY stream holds the +lower 64 bits. Both streams use signed integer RLE v2. --- End diff -- The main problem is that we don't have 128-bit integer RLE on hand. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436270#comment-16436270 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/245 Will provide them after comprehensive benchmark. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436216#comment-16436216 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181203617 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- Some part of this discussion is about the new ORC format and existing reader compatibility is not a requirement, until we switch to the new format as a default. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436214#comment-16436214 ] ASF GitHub Bot commented on ORC-161: Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181202668 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL encoding is introduced and totally remove scale +stream as all decimal values use the same scale. When precision is +no greater than 18, decimal values can be fully represented by DATA +stream which stores 64-bit signed integers. When precision is greater +than 18, we use a 128-bit signed integer to store the decimal value. +DATA stream stores the higher 64 bits and SECONDARY stream holds the +lower 64 bits. Both streams use signed integer RLE v2. --- End diff -- The multiple-stream + row-group stride problems for IO were discussed by Owen. The disk layout is what matters for IO, not the logical stream separation. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436183#comment-16436183 ] ASF GitHub Bot commented on ORC-161: Github user prasanthj commented on the issue: https://github.com/apache/orc/pull/245 "we found RLEv1 + zstd may be the best combination than others in terms of both compression ration and encoding/decoding speed." do you have experimental numbers for this? > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ORC-318) Change HadoopShims.KeyProvider to separate createLocalKey and decryptLocalKey
[ https://issues.apache.org/jira/browse/ORC-318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved ORC-318. --- Resolution: Fixed Fix Version/s: 1.5.0 > Change HadoopShims.KeyProvider to separate createLocalKey and decryptLocalKey > - > > Key: ORC-318 > URL: https://issues.apache.org/jira/browse/ORC-318 > Project: ORC > Issue Type: Sub-task >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > Fix For: 1.5.0 > > > Looking through the [AWS > KMS|https://docs.aws.amazon.com/kms/latest/APIReference/Welcome.html] docs, > to be compatible we should probably separate creating a local key from > decrypting it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-339) Reorganize ORC specification
[ https://issues.apache.org/jira/browse/ORC-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436363#comment-16436363 ] ASF GitHub Bot commented on ORC-339: GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/247 ORC-339. Reorganize the ORC file format specification. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-339 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/247.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #247 commit 5c56d74d948a73f5c456e0e80ff0622505d6c1cf Author: Owen O'MalleyDate: 2018-04-12T22:03:00Z ORC-339. Reorganize the ORC file format specification. > Reorganize ORC specification > > > Key: ORC-339 > URL: https://issues.apache.org/jira/browse/ORC-339 > Project: ORC > Issue Type: Improvement >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > > Currently we've put the ORC format specification in the documentation. Now > that we are starting the work to design ORCv2, it will be more convenient to > have each file format version as a separate page. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ORC-339) Reorganize ORC specification
[ https://issues.apache.org/jira/browse/ORC-339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned ORC-339: - > Reorganize ORC specification > > > Key: ORC-339 > URL: https://issues.apache.org/jira/browse/ORC-339 > Project: ORC > Issue Type: Improvement >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > > Currently we've put the ORC format specification in the documentation. Now > that we are starting the work to design ORCv2, it will be more convenient to > have each file format version as a separate page. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-339) Reorganize ORC specification
[ https://issues.apache.org/jira/browse/ORC-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436395#comment-16436395 ] ASF GitHub Bot commented on ORC-339: Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/247#discussion_r181239251 --- Diff: site/specification/ORCv2.md --- @@ -0,0 +1,1032 @@ +--- +layout: page +title: Evolving Draft for ORC Specification v2 +--- + +This specification is rapidly evolving and should only be used for +developers on the project. + +# TO DO items + +The list of things that we plan to change: + +* Create a decimal representation with fixed scale using rle. +* Create a better float/double encoding that splits mantissa and + exponent. +* Create a dictionary encoding for float, double, and decimal. +* Create RLEv3: + * 64 and 128 bit variants + * Zero suppression + * Evaluate the rle subformats +* Group stripe data into stripelets to enable Async IO for reads. +* Reorder stripe data into (stripe metadata, index, dictionary, data) +* Stop sorting dictionaries and record the sort order separately in the index. +* Remove use of RLEv1 and RLEv2. +* Remove non-utf8 bloom filter. +* Use numeric value for decimal bloom filter. --- End diff -- We may also use numeric value for decimal column statistics > Reorganize ORC specification > > > Key: ORC-339 > URL: https://issues.apache.org/jira/browse/ORC-339 > Project: ORC > Issue Type: Improvement >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > > Currently we've put the ORC format specification in the documentation. Now > that we are starting the work to design ORCv2, it will be more convenient to > have each file format version as a separate page. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals
[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436156#comment-16436156 ] ASF GitHub Bot commented on ORC-161: Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/245 After second thought, I added back DECIMAL_V1 to support RLE v1 in decimal encoding. The reason is that in our testing, we found RLEv1 + zstd may be the best combination than others in terms of both compression ration and encoding/decoding speed. > Create a new column type that run-length-encodes decimals > - > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding >Reporter: Douglas Drinka >Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-318) Change HadoopShims.KeyProvider to separate createLocalKey and decryptLocalKey
[ https://issues.apache.org/jira/browse/ORC-318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435737#comment-16435737 ] ASF GitHub Bot commented on ORC-318: Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/227 > Change HadoopShims.KeyProvider to separate createLocalKey and decryptLocalKey > - > > Key: ORC-318 > URL: https://issues.apache.org/jira/browse/ORC-318 > Project: ORC > Issue Type: Sub-task >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > > Looking through the [AWS > KMS|https://docs.aws.amazon.com/kms/latest/APIReference/Welcome.html] docs, > to be compatible we should probably separate creating a local key from > decrypting it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ORC-338) Workaround C++ compiler bug in newest clang including xcode 9.3
[ https://issues.apache.org/jira/browse/ORC-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435740#comment-16435740 ] ASF GitHub Bot commented on ORC-338: Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/246 > Workaround C++ compiler bug in newest clang including xcode 9.3 > --- > > Key: ORC-338 > URL: https://issues.apache.org/jira/browse/ORC-338 > Project: ORC > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > > The ColumnStatistics.intColumnStatistics test fails in the xcode 9.3 if you > use the release build, but passes in the debug build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ORC-338) Workaround C++ compiler bug in newest clang including xcode 9.3
[ https://issues.apache.org/jira/browse/ORC-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved ORC-338. --- Resolution: Fixed Fix Version/s: 1.5.0 > Workaround C++ compiler bug in newest clang including xcode 9.3 > --- > > Key: ORC-338 > URL: https://issues.apache.org/jira/browse/ORC-338 > Project: ORC > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Major > Fix For: 1.5.0 > > > The ColumnStatistics.intColumnStatistics test fails in the xcode 9.3 if you > use the release build, but passes in the debug build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)