Re: [PR] GH-45185: Add bad_data file with invalid repetition levels [parquet-testing]
mapleFU merged PR #67: URL: https://github.com/apache/parquet-testing/pull/67 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-45185: Add bad_data file with invalid repetition levels [parquet-testing]
adamreeve commented on PR #67: URL: https://github.com/apache/parquet-testing/pull/67#issuecomment-2579050396 Thanks! I thought I left a comment earlier but GitHub was having an outage so maybe it got lost. I made the suggested changes but kept the data uncompressed as enabling zstd compression actually increased the file size slightly with such small data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-45185: Add bad_data file with invalid repetition levels [parquet-testing]
wgtmac commented on PR #67: URL: https://github.com/apache/parquet-testing/pull/67#issuecomment-2579044399 Thanks for the update! Will merge it tomorrow if no objection. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-45185: Add bad_data file with invalid repetition levels [parquet-testing]
wgtmac commented on PR #67: URL: https://github.com/apache/parquet-testing/pull/67#issuecomment-2576780835 ``` File path: bad_data/ARROW-GH-45185.parquet Created by: parquet-cpp-arrow version 19.0.0-SNAPSHOT Properties: (none) Schema: message schema { repeated int64 int64_field; } Row group 0: count: 50 19.10 B records start: 4 total(compressed): 955 B total(uncompressed):955 B type encodings count avg size nulls min / max int64_field INT64 _ _ R 100 9.55 B 0 "0" / "99" Column: int64_field page type enc count avg size size rows nulls min / max 0-Ddict _ _ 100 8.00 B 800 B 0-1data _ R 100 1.18 B 118 B "columnIndexReference" : { "offset" : 959, "length" : 31 }, "offsetIndexReference" : { "offset" : 990, "length" : 12 }, ``` The file size is 1.2K. Could we reduce it as much as possible? For example: - leverage compression like zstd - disable dictionary encoding - disable page index - reduce row count BTW, `repeated int64 int64_field` is a special case of unannotated list type which we should avoid: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md?plain=1#L607-L624. Should we replace it with LIST-annotated type? cc @pitrou @mapleFU -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-45185: Add bad_data file with invalid repetition levels [parquet-testing]
raulcd commented on PR #67: URL: https://github.com/apache/parquet-testing/pull/67#issuecomment-2575641035 CC @wgtmac -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org