Re: [PR] GH-45185: Add bad_data file with invalid repetition levels [parquet-testing]

2025-01-13 Thread via GitHub


mapleFU merged PR #67:
URL: https://github.com/apache/parquet-testing/pull/67


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-45185: Add bad_data file with invalid repetition levels [parquet-testing]

2025-01-08 Thread via GitHub


adamreeve commented on PR #67:
URL: https://github.com/apache/parquet-testing/pull/67#issuecomment-2579050396

   Thanks! I thought I left a comment earlier but GitHub was having an outage 
so maybe it got lost. I made the suggested changes but kept the data 
uncompressed as enabling zstd compression actually increased the file size 
slightly with such small data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-45185: Add bad_data file with invalid repetition levels [parquet-testing]

2025-01-08 Thread via GitHub


wgtmac commented on PR #67:
URL: https://github.com/apache/parquet-testing/pull/67#issuecomment-2579044399

   Thanks for the update! Will merge it tomorrow if no objection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-45185: Add bad_data file with invalid repetition levels [parquet-testing]

2025-01-07 Thread via GitHub


wgtmac commented on PR #67:
URL: https://github.com/apache/parquet-testing/pull/67#issuecomment-2576780835

   ```
   File path:  bad_data/ARROW-GH-45185.parquet
   Created by: parquet-cpp-arrow version 19.0.0-SNAPSHOT
   Properties: (none)
   Schema:
   message schema {
 repeated int64 int64_field;
   }
   
   Row group 0:  count: 50  19.10 B records  start: 4  total(compressed): 955 B 
total(uncompressed):955 B
   

type  encodings count avg size   nulls   min / max
   int64_field  INT64 _ _ R 100   9.55 B 0   "0" / 
"99"
   
   
   Column: int64_field
   

 page   type  enc  count   avg size   size   rows nulls   min / max
 0-Ddict  _ _  100 8.00 B 800 B
 0-1data  _ R  100 1.18 B 118 B
   
 "columnIndexReference" : {
   "offset" : 959,
   "length" : 31
 },
 "offsetIndexReference" : {
   "offset" : 990,
   "length" : 12
 },
   ```
   
   The file size is 1.2K. Could we reduce it as much as possible? For example:
   - leverage compression like zstd
   - disable dictionary encoding
   - disable page index
   - reduce row count
   
   BTW, `repeated int64 int64_field` is a special case of unannotated list type 
which we should avoid: 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md?plain=1#L607-L624.
 Should we replace it with LIST-annotated type? cc @pitrou @mapleFU 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-45185: Add bad_data file with invalid repetition levels [parquet-testing]

2025-01-07 Thread via GitHub


raulcd commented on PR #67:
URL: https://github.com/apache/parquet-testing/pull/67#issuecomment-2575641035

   CC @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org