[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678927#comment-17678927
 ] 

ASF GitHub Bot commented on PARQUET-2231:
-

pitrou merged PR #189:
URL: https://github.com/apache/parquet-format/pull/189




> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678918#comment-17678918
 ] 

ASF GitHub Bot commented on PARQUET-2231:
-

pitrou commented on code in PR #189:
URL: https://github.com/apache/parquet-format/pull/189#discussion_r1081911430


##
Encodings.md:
##
@@ -299,9 +302,18 @@ For a longer description, see 
https://en.wikipedia.org/wiki/Incremental_encoding
 This is stored as a sequence of delta-encoded prefix lengths 
(DELTA_BINARY_PACKED), followed by
 the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
 
+For example, if the data was "axis", "axle", "babble", "babyhood":
+
+The encoded data would be comprised of the following segments:

Review Comment:
   ```suggestion
   For example, if the data was "axis", "axle", "babble", "babyhood"
   
   then the encoded data would be comprised of the following segments:
   ```





> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1767#comment-1767
 ] 

ASF GitHub Bot commented on PARQUET-2231:
-

wjones127 commented on code in PR #189:
URL: https://github.com/apache/parquet-format/pull/189#discussion_r1081899568


##
Encodings.md:
##
@@ -280,16 +280,19 @@ concatenated back to back. The expected savings is from 
the cost of encoding the
 and possibly better compression in the data (it is no longer interleaved with 
the lengths).
 
 The data stream looks like:
-
+```
  
+```
 
-For example, if the data was "Hello", "World", "Foobar", "ABCDEF":
+For example, if the data was "Hello", "World", "Foobar", "ABCDEF"
 
-The encoded data would be DeltaEncoding(5, 5, 6, 6) "HelloWorldFoobarABCDEF"
+The encoded data would be comprised of the following segments:

Review Comment:
   ```suggestion
   then the encoded data would be comprised of the following segments:
   ```





> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677313#comment-17677313
 ] 

ASF GitHub Bot commented on PARQUET-2231:
-

pitrou commented on PR #189:
URL: https://github.com/apache/parquet-format/pull/189#issuecomment-1383840418

   Also cc @rok




> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677307#comment-17677307
 ] 

ASF GitHub Bot commented on PARQUET-2231:
-

pitrou commented on PR #189:
URL: https://github.com/apache/parquet-format/pull/189#issuecomment-1383831257

   @emkornfield @gszadovszky @rdblue 




> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677306#comment-17677306
 ] 

ASF GitHub Bot commented on PARQUET-2231:
-

pitrou commented on PR #189:
URL: https://github.com/apache/parquet-format/pull/189#issuecomment-1383830870

   @wjones127 Could you help review the wording?




> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677305#comment-17677305
 ] 

ASF GitHub Bot commented on PARQUET-2231:
-

pitrou opened a new pull request, #189:
URL: https://github.com/apache/parquet-format/pull/189

   DELTA_BYTE_ARRAY has been supported for FIXED_LEN_BYTE_ARRAY by parquet-mr 
since 2015 (see PARQUET-152). Update the spec in consequence.
   
   Also improve wording, markup and add an example.
   
   




> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677300#comment-17677300
 ] 

Antoine Pitrou commented on PARQUET-2231:
-

[~rok] [~shanhuang] [~muthunagappan] [~jinshang] FYI

> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)