[jira] [Resolved] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
[ https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved PARQUET-2231. - Resolution: Fixed Closed by PR https://github.com/apache/parquet-format/pull/189 > [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY > - > > Key: PARQUET-2231 > URL: https://issues.apache.org/jira/browse/PARQUET-2231 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Critical > Fix For: format-2.10.0 > > > The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in > parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
[ https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678927#comment-17678927 ] ASF GitHub Bot commented on PARQUET-2231: - pitrou merged PR #189: URL: https://github.com/apache/parquet-format/pull/189 > [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY > - > > Key: PARQUET-2231 > URL: https://issues.apache.org/jira/browse/PARQUET-2231 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Critical > Fix For: format-2.10.0 > > > The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in > parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] pitrou merged pull request #189: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY
pitrou merged PR #189: URL: https://github.com/apache/parquet-format/pull/189 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
[ https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678918#comment-17678918 ] ASF GitHub Bot commented on PARQUET-2231: - pitrou commented on code in PR #189: URL: https://github.com/apache/parquet-format/pull/189#discussion_r1081911430 ## Encodings.md: ## @@ -299,9 +302,18 @@ For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY). +For example, if the data was "axis", "axle", "babble", "babyhood": + +The encoded data would be comprised of the following segments: Review Comment: ```suggestion For example, if the data was "axis", "axle", "babble", "babyhood" then the encoded data would be comprised of the following segments: ``` > [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY > - > > Key: PARQUET-2231 > URL: https://issues.apache.org/jira/browse/PARQUET-2231 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Critical > Fix For: format-2.10.0 > > > The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in > parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] pitrou commented on a diff in pull request #189: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY
pitrou commented on code in PR #189: URL: https://github.com/apache/parquet-format/pull/189#discussion_r1081911430 ## Encodings.md: ## @@ -299,9 +302,18 @@ For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY). +For example, if the data was "axis", "axle", "babble", "babyhood": + +The encoded data would be comprised of the following segments: Review Comment: ```suggestion For example, if the data was "axis", "axle", "babble", "babyhood" then the encoded data would be comprised of the following segments: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
[ https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1767#comment-1767 ] ASF GitHub Bot commented on PARQUET-2231: - wjones127 commented on code in PR #189: URL: https://github.com/apache/parquet-format/pull/189#discussion_r1081899568 ## Encodings.md: ## @@ -280,16 +280,19 @@ concatenated back to back. The expected savings is from the cost of encoding the and possibly better compression in the data (it is no longer interleaved with the lengths). The data stream looks like: - +``` +``` -For example, if the data was "Hello", "World", "Foobar", "ABCDEF": +For example, if the data was "Hello", "World", "Foobar", "ABCDEF" -The encoded data would be DeltaEncoding(5, 5, 6, 6) "HelloWorldFoobarABCDEF" +The encoded data would be comprised of the following segments: Review Comment: ```suggestion then the encoded data would be comprised of the following segments: ``` > [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY > - > > Key: PARQUET-2231 > URL: https://issues.apache.org/jira/browse/PARQUET-2231 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Critical > Fix For: format-2.10.0 > > > The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in > parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] wjones127 commented on a diff in pull request #189: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY
wjones127 commented on code in PR #189: URL: https://github.com/apache/parquet-format/pull/189#discussion_r1081899568 ## Encodings.md: ## @@ -280,16 +280,19 @@ concatenated back to back. The expected savings is from the cost of encoding the and possibly better compression in the data (it is no longer interleaved with the lengths). The data stream looks like: - +``` +``` -For example, if the data was "Hello", "World", "Foobar", "ABCDEF": +For example, if the data was "Hello", "World", "Foobar", "ABCDEF" -The encoded data would be DeltaEncoding(5, 5, 6, 6) "HelloWorldFoobarABCDEF" +The encoded data would be comprised of the following segments: Review Comment: ```suggestion then the encoded data would be comprised of the following segments: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org