[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated files
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2181: - Summary: parquet-cli fails at supporting parquet-protobuf generated files (was: parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them) > parquet-cli fails at supporting parquet-protobuf generated files > > > Key: PARQUET-2181 > URL: https://issues.apache.org/jira/browse/PARQUET-2181 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Reporter: J Y >Priority: Critical > Attachments: samples.tgz > > > i generated a parquet file using a protobuf with this proto definition: > {code:java} > message IndexPath { > // Index of item in path. > repeated int32 index = 1; > } > message SomeEvent { > // truncated/obfuscated wrapper > optional IndexPath client_position = 1; > } > {code} > this gets translated to the following parquet schema using the new compliant > schema for lists: > {code:java} > message SomeEvent { > optional group client_position = 1 { > optional group index (LIST) = 1 { > repeated group list { > required int32 element; > } > } > } > } > {code} > this causes parquet-cli cat to barf on a file containing these events: > {quote}java.lang.RuntimeException: Failed on record 0 > at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) > at org.apache.parquet.cli.Main.run(Main.java:157) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:187) > Caused by: java.lang.ClassCastException: required int32 element is not a group > at org.apache.parquet.schema.Type.asGroupType(Type.java:248) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) > at > org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) > at > org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) > at > org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) > at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) > at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) > at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) > at > org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) > at > org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) > at > org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) > at > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) > at > org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) > at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) > at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) > at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) > ... 3 more > {quote} > using the old parquet-tools binary to cat this file works fine. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2181: - Priority: Critical (was: Major) > parquet-cli fails at supporting parquet-protobuf generated schemas that have > repeated primitives in them > > > Key: PARQUET-2181 > URL: https://issues.apache.org/jira/browse/PARQUET-2181 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Reporter: J Y >Priority: Critical > Attachments: samples.tgz > > > i generated a parquet file using a protobuf with this proto definition: > {code:java} > message IndexPath { > // Index of item in path. > repeated int32 index = 1; > } > message SomeEvent { > // truncated/obfuscated wrapper > optional IndexPath client_position = 1; > } > {code} > this gets translated to the following parquet schema using the new compliant > schema for lists: > {code:java} > message SomeEvent { > optional group client_position = 1 { > optional group index (LIST) = 1 { > repeated group list { > required int32 element; > } > } > } > } > {code} > this causes parquet-cli cat to barf on a file containing these events: > {quote}java.lang.RuntimeException: Failed on record 0 > at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) > at org.apache.parquet.cli.Main.run(Main.java:157) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:187) > Caused by: java.lang.ClassCastException: required int32 element is not a group > at org.apache.parquet.schema.Type.asGroupType(Type.java:248) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) > at > org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) > at > org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) > at > org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) > at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) > at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) > at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) > at > org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) > at > org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) > at > org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) > at > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) > at > org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) > at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) > at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) > at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) > ... 3 more > {quote} > using the old parquet-tools binary to cat this file works fine. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598196#comment-17598196 ] J Y commented on PARQUET-2181: -- i've attached some parquet files that all read fine using parquet-tools (both the deprecated version from parquet-mr and the [one written in python|https://github.com/ktrueda/parquet-tools]) *but do not read at all using parquet-cli*. parquet-cli's meta command works fine. it turns out, there's other stack traces when trying to use parquet-cli to read these files. in addition to the repeated primitive issue highlighted originally, there's 2 other issues like the following exhibited in these files: {quote}--- ./raw/delivery-log/dt=2022-08-10/hour=04/part-02a95a0e-bd21-4476-9d0f-d1896687b12a-0 Argument error: Map key type must be binary (UTF8): required int32 key --- ./raw/user/dt=2022-08-10/hour=04/part-8cac1d0c-fb7f-4a9a-b77e-b3dd59f89333-0 Unknown error java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch: Avro field 'null_value' not found at org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:221) {quote} is using AvroReadSupport and AvroRecrodConverter the right way to go for protobufs? it looks like the parquet-tools that was deprecated in 1.12.3+ doesn't use the parquet-avro approach to reading (it uses [its own SimpleReadSupport approach|https://github.com/apache/parquet-mr/tree/apache-parquet-1.12.2/parquet-tools-deprecated/src/main/java/org/apache/parquet/tools/read]), which makes sense to me given the underlying schema and data written in parquet-protobuf generated files are not avro... should we move parquet-cli back to SimpleReadSupport instead of relying on what appears to be a broken AvroReadSupport when dealing with proto generated files? > parquet-cli fails at supporting parquet-protobuf generated schemas that have > repeated primitives in them > > > Key: PARQUET-2181 > URL: https://issues.apache.org/jira/browse/PARQUET-2181 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Reporter: J Y >Priority: Major > Attachments: samples.tgz > > > i generated a parquet file using a protobuf with this proto definition: > {code:java} > message IndexPath { > // Index of item in path. > repeated int32 index = 1; > } > message SomeEvent { > // truncated/obfuscated wrapper > optional IndexPath client_position = 1; > } > {code} > this gets translated to the following parquet schema using the new compliant > schema for lists: > {code:java} > message SomeEvent { > optional group client_position = 1 { > optional group index (LIST) = 1 { > repeated group list { > required int32 element; > } > } > } > } > {code} > this causes parquet-cli cat to barf on a file containing these events: > {quote}java.lang.RuntimeException: Failed on record 0 > at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) > at org.apache.parquet.cli.Main.run(Main.java:157) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:187) > Caused by: java.lang.ClassCastException: required int32 element is not a group > at org.apache.parquet.schema.Type.asGroupType(Type.java:248) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) > at > org.
[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2181: - Attachment: samples.tgz > parquet-cli fails at supporting parquet-protobuf generated schemas that have > repeated primitives in them > > > Key: PARQUET-2181 > URL: https://issues.apache.org/jira/browse/PARQUET-2181 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Reporter: J Y >Priority: Major > Attachments: samples.tgz > > > i generated a parquet file using a protobuf with this proto definition: > {code:java} > message IndexPath { > // Index of item in path. > repeated int32 index = 1; > } > message SomeEvent { > // truncated/obfuscated wrapper > optional IndexPath client_position = 1; > } > {code} > this gets translated to the following parquet schema using the new compliant > schema for lists: > {code:java} > message SomeEvent { > optional group client_position = 1 { > optional group index (LIST) = 1 { > repeated group list { > required int32 element; > } > } > } > } > {code} > this causes parquet-cli cat to barf on a file containing these events: > {quote}java.lang.RuntimeException: Failed on record 0 > at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) > at org.apache.parquet.cli.Main.run(Main.java:157) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:187) > Caused by: java.lang.ClassCastException: required int32 element is not a group > at org.apache.parquet.schema.Type.asGroupType(Type.java:248) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) > at > org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) > at > org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) > at > org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) > at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) > at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) > at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) > at > org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) > at > org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) > at > org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) > at > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) > at > org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) > at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) > at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) > at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) > ... 3 more > {quote} > using the old parquet-tools binary to cat this file works fine. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598185#comment-17598185 ] ASF GitHub Bot commented on PARQUET-758: gszadovszky commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1232491783 I've came up with this ordering thing because we specify it for every logical types. (Unfortunately we don't do this for primitive types.) Therefore, I would expect to have the order specified for this new logical type as well which is not trivial and requires to solve [PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222). At least we should add a note about this issue. > It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today? I do not have the proper environment to test it but based on the code we do not handle unknown logical types well in parquet-mr. I think it handles unknown logical types as if they were not there at all which is fine from the data point of view but we would blindly use the statistics which may cause data loss. Created [PARQUET-2182](https://issues.apache.org/jira/browse/PARQUET-2182) to track this. > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] gszadovszky commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type
gszadovszky commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1232491783 I've came up with this ordering thing because we specify it for every logical types. (Unfortunately we don't do this for primitive types.) Therefore, I would expect to have the order specified for this new logical type as well which is not trivial and requires to solve [PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222). At least we should add a note about this issue. > It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today? I do not have the proper environment to test it but based on the code we do not handle unknown logical types well in parquet-mr. I think it handles unknown logical types as if they were not there at all which is fine from the data point of view but we would blindly use the statistics which may cause data loss. Created [PARQUET-2182](https://issues.apache.org/jira/browse/PARQUET-2182) to track this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (PARQUET-2182) Handle unknown logical types
Gabor Szadovszky created PARQUET-2182: - Summary: Handle unknown logical types Key: PARQUET-2182 URL: https://issues.apache.org/jira/browse/PARQUET-2182 Project: Parquet Issue Type: Bug Reporter: Gabor Szadovszky New logical types introduced in parquet-format shall be properly handled in parquet-mr releases that are not aware of this new type. In this case we shall read the data as if only the primitive type would be defined (without a logical type) with one exception: We shall not use min/max based statistics (including column indexes) since we don't know the proper ordering of that type. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2181: - Description: i generated a parquet file using a protobuf with this proto definition: {code:java} message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {code} this gets translated to the following parquet schema using the new compliant schema for lists: {code:java} message SomeEvent { optional group client_position = 1 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } } {code} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) at org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) ... 3 more {quote} using the old parquet-tools binary to cat this file works fine. was: i generated a parquet file using a protobuf with this proto definition: {code:java} message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {code} this gets translated to the following parquet schema using the new compliant schema for lists: {code:java} message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } } {code} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) a
[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2181: - Description: i generated a parquet file using a protobuf with this proto definition: {code:java} message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {code} this gets translated to the following parquet schema using the new compliant schema for lists: {code:java} message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } } {code} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) at org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) ... 3 more {quote} using the old parquet-tools binary to cat this file works fine. was: i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.pa
[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2181: - Description: i generated a parquet file using a protobuf with this proto definition: {code:java} message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {code} this gets translated to the following parquet schema using the new compliant schema for lists: {code:java} message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } } {code} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) at org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) ... 3 more {quote} using the old parquet-tools binary to cat this file works fine. was: i generated a parquet file using a protobuf with this proto definition: {code:java} message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {code} this gets translated to the following parquet schema using the new compliant schema for lists: {code:java} message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } } {code} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at
[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2181: - Description: i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) at org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) ... 3 more {quote} using the old parquet-tools binary to cat this file works fine. was: i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.
[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2181: - Description: i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) at org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) ... 3 more{quote} using the old parquet-tools binary to cat this file works fine. was: i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list {{}} required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apach
[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2181: - Description: i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list } required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) at org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) ... 3 more{quote} using the old parquet-tools binary to cat this file works fine. was: i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.p
[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2181: - Description: i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) at org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) ... 3 more{quote} using the old parquet-tools binary to cat this file works fine. was: i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list } required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.p
[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
[ https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2181: - Description: i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list {{}} required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) at org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) ... 3 more{quote} using the old parquet-tools binary to cat this file works fine. was: i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apach
[jira] [Created] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them
J Y created PARQUET-2181: Summary: parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them Key: PARQUET-2181 URL: https://issues.apache.org/jira/browse/PARQUET-2181 Project: Parquet Issue Type: Bug Components: parquet-cli Reporter: J Y i generated a parquet file using a protobuf with this proto definition: {quote}message IndexPath { // Index of item in path. repeated int32 index = 1; } message SomeEvent { // truncated/obfuscated wrapper optional IndexPath client_position = 1; } {quote} this gets translated to the following parquet schema using the new compliant schema for lists: {quote}message SomeEvent { optional group client_position = 24 { optional group index (LIST) = 1 { repeated group list { required int32 element; } } } }{quote} this causes parquet-cli cat to barf on a file containing these events: {quote}java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86) at org.apache.parquet.cli.Main.run(Main.java:157) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.parquet.cli.Main.main(Main.java:187) Caused by: java.lang.ClassCastException: required int32 element is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:248) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137) at org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91) at org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363) at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344) at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73) ... 3 more{quote} using the old parquet-tools binary to cat this file works fine. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type
[ https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598174#comment-17598174 ] ASF GitHub Bot commented on PARQUET-1711: - jinyius commented on code in PR #988: URL: https://github.com/apache/parquet-mr/pull/988#discussion_r959168794 ## parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java: ## @@ -79,12 +80,20 @@ public MessageType convert(Class protobufClass) { } /* Iterates over list of fields. **/ - private GroupBuilder convertFields(GroupBuilder groupBuilder, List fieldDescriptors) { + private GroupBuilder convertFields(GroupBuilder groupBuilder, List fieldDescriptors, List parentNames) { for (FieldDescriptor fieldDescriptor : fieldDescriptors) { - groupBuilder = - addField(fieldDescriptor, groupBuilder) + final String name = fieldDescriptor.getFullName(); + final List newParentNames = new ArrayList<>(parentNames); + newParentNames.add(name); + if (parentNames.contains(name)) { +// Circular dependency, skip +LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(), Review Comment: i had been working on this issue as well and arrived at a similar solution to this one (however, without skipping/losing data) and linked to the prs in this pr conversation. ptal, and if you folks prefer it, i can submit a merge against head and close out this pr. > [parquet-protobuf] stack overflow when work with well known json type > - > > Key: PARQUET-1711 > URL: https://issues.apache.org/jira/browse/PARQUET-1711 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.10.1 >Reporter: Lawrence He >Priority: Major > > Writing following protobuf message as parquet file is not possible: > {code:java} > syntax = "proto3"; > import "google/protobuf/struct.proto"; > package test; > option java_outer_classname = "CustomMessage"; > message TestMessage { > map data = 1; > } {code} > Protobuf introduced "well known json type" such like > [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue] > to work around json schema conversion. > However writing above messages traps parquet writer into an infinite loop due > to the "general type" support in protobuf. Current implementation will keep > referencing 6 possible types defined in protobuf (null, bool, number, string, > struct, list) and entering infinite loop when referencing "struct". > {code:java} > java.lang.StackOverflowErrorjava.lang.StackOverflowError at > java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at > java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at > java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044) > at > java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] jinyius commented on a diff in pull request #988: PARQUET-1711: Break circular dependencies in proto definitions
jinyius commented on code in PR #988: URL: https://github.com/apache/parquet-mr/pull/988#discussion_r959168794 ## parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java: ## @@ -79,12 +80,20 @@ public MessageType convert(Class protobufClass) { } /* Iterates over list of fields. **/ - private GroupBuilder convertFields(GroupBuilder groupBuilder, List fieldDescriptors) { + private GroupBuilder convertFields(GroupBuilder groupBuilder, List fieldDescriptors, List parentNames) { for (FieldDescriptor fieldDescriptor : fieldDescriptors) { - groupBuilder = - addField(fieldDescriptor, groupBuilder) + final String name = fieldDescriptor.getFullName(); + final List newParentNames = new ArrayList<>(parentNames); + newParentNames.add(name); + if (parentNames.contains(name)) { +// Circular dependency, skip +LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(), Review Comment: i had been working on this issue as well and arrived at a similar solution to this one (however, without skipping/losing data) and linked to the prs in this pr conversation. ptal, and if you folks prefer it, i can submit a merge against head and close out this pr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type
[ https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598171#comment-17598171 ] ASF GitHub Bot commented on PARQUET-1711: - jinyius commented on PR #988: URL: https://github.com/apache/parquet-mr/pull/988#issuecomment-1232470935 hmm... what timing. i actually have a pr for what i think is a more robust approach that truncates at an arbitrary recursion depth by putting the remaining recursion levels into a binary blob. this approach lets downstream querying things query the non-truncated parts fine, and allows for udfs to be defined to reinstantiate the truncated recursed fields. i didn't submit the pr for merge quite yet b/c i'm busy trying to finish off the overall project i needed this for at work, so it's just coded against 1.12.3 and not head. ptal, and if everyone likes my proposal, i can spend a few cycles and move it to head: schema converter pr: - https://github.com/promotedai/parquet-mr/pull/1 write support pr: - https://github.com/promotedai/parquet-mr/pull/2 > [parquet-protobuf] stack overflow when work with well known json type > - > > Key: PARQUET-1711 > URL: https://issues.apache.org/jira/browse/PARQUET-1711 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.10.1 >Reporter: Lawrence He >Priority: Major > > Writing following protobuf message as parquet file is not possible: > {code:java} > syntax = "proto3"; > import "google/protobuf/struct.proto"; > package test; > option java_outer_classname = "CustomMessage"; > message TestMessage { > map data = 1; > } {code} > Protobuf introduced "well known json type" such like > [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue] > to work around json schema conversion. > However writing above messages traps parquet writer into an infinite loop due > to the "general type" support in protobuf. Current implementation will keep > referencing 6 possible types defined in protobuf (null, bool, number, string, > struct, list) and entering infinite loop when referencing "struct". > {code:java} > java.lang.StackOverflowErrorjava.lang.StackOverflowError at > java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at > java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at > java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044) > at > java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] jinyius commented on pull request #988: PARQUET-1711: Break circular dependencies in proto definitions
jinyius commented on PR #988: URL: https://github.com/apache/parquet-mr/pull/988#issuecomment-1232470935 hmm... what timing. i actually have a pr for what i think is a more robust approach that truncates at an arbitrary recursion depth by putting the remaining recursion levels into a binary blob. this approach lets downstream querying things query the non-truncated parts fine, and allows for udfs to be defined to reinstantiate the truncated recursed fields. i didn't submit the pr for merge quite yet b/c i'm busy trying to finish off the overall project i needed this for at work, so it's just coded against 1.12.3 and not head. ptal, and if everyone likes my proposal, i can spend a few cycles and move it to head: schema converter pr: - https://github.com/promotedai/parquet-mr/pull/1 write support pr: - https://github.com/promotedai/parquet-mr/pull/2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598151#comment-17598151 ] ASF GitHub Bot commented on PARQUET-758: emkornfield commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1232420353 > t is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See [PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222) for details. I think these are orthogonal. I might be missing something but it seems like it would not be to hard to case float16 to float in java/cpp and do the comparison in that space and cast it back down. This might not be the most efficient implementation but would be straightforward? I am probably missing something. It would be nice to resolve [PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222) so the same semantics would apply to all floating point numbers. > The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today? > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] emkornfield commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type
emkornfield commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1232420353 > t is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See [PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222) for details. I think these are orthogonal. I might be missing something but it seems like it would not be to hard to case float16 to float in java/cpp and do the comparison in that space and cast it back down. This might not be the most efficient implementation but would be straightforward? I am probably missing something. It would be nice to resolve [PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222) so the same semantics would apply to all floating point numbers. > The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (PARQUET-2180) make the default behavior for proto writing not-backwards compatible
[ https://issues.apache.org/jira/browse/PARQUET-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2180: - Description: https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps and lists in a spec compliant way. however, to not break existing libraries, a flag was introduced and defaulted the write behavior to NOT use the specs compliant writes. it's been over 5 years, and people should be really off of it. so much so, that trying to use the new parquet-cli tool to read parquet files generated by flink doesn't work b/c it's hard coded to never allow the old style. the deprecated parquet-tools reads these files fine b/c it's the older style. i started coding up a workaround in flink-parquet and parquet-cli, but stopped. we really should just move on at this point, imho. protobufs often have repeated primitives and maps, so it's more pressing to get proper specs compliant support for it now. we should keep the flag around and let people override it back to being backwards compatible though. i have the code written and can submit a PR if you'd like. i'm not an expert in parquet though, so i'm unclear as to the deep downstream ramifications of this change, so i would love to get feedback in this area. was: https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps and lists in a spec compliant way. however, to not break existing libraries, a flag was introduced and defaulted the write behavior to NOT use the specs compliant writes. it's been over 5 years, and people should be really off of it. so much so, that trying to use the new parquet-cli tool to read parquet files generated by flink doesn't work b/c it's hard coded to never allow the old style. the deprecated parquet-tools reads these files fine b/c it's the older style. i started coding up a workaround in flink-parquet and parquet-cli, but stopped. we really should just move on at this point, imho. protobufs often have repeated primitives and maps now, so it just makes sense to move on at this point. we should keep the flag around and let people override it back to being backwards compatible though. i have the code written and can submit a PR if you'd like. i'm not an expert in parquet though, so i'm unclear as to the deep downstream ramifications of this change, so i would love to get feedback in this area. > make the default behavior for proto writing not-backwards compatible > > > Key: PARQUET-2180 > URL: https://issues.apache.org/jira/browse/PARQUET-2180 > Project: Parquet > Issue Type: Improvement > Components: parquet-protobuf >Reporter: J Y >Priority: Minor > > https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps > and lists in a spec compliant way. however, to not break existing libraries, > a flag was introduced and defaulted the write behavior to NOT use the specs > compliant writes. > it's been over 5 years, and people should be really off of it. so much so, > that trying to use the new parquet-cli tool to read parquet files generated > by flink doesn't work b/c it's hard coded to never allow the old style. the > deprecated parquet-tools reads these files fine b/c it's the older style. > i started coding up a workaround in flink-parquet and parquet-cli, but > stopped. we really should just move on at this point, imho. protobufs often > have repeated primitives and maps, so it's more pressing to get proper specs > compliant support for it now. we should keep the flag around and let people > override it back to being backwards compatible though. > i have the code written and can submit a PR if you'd like. > i'm not an expert in parquet though, so i'm unclear as to the deep downstream > ramifications of this change, so i would love to get feedback in this area. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2180) make the default behavior for proto writing not-backwards compatible
[ https://issues.apache.org/jira/browse/PARQUET-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J Y updated PARQUET-2180: - Description: https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps and lists in a spec compliant way. however, to not break existing libraries, a flag was introduced and defaulted the write behavior to NOT use the specs compliant writes. it's been over 5 years, and people should be really off of it. so much so, that trying to use the new parquet-cli tool to read parquet files generated by flink doesn't work b/c it's hard coded to never allow the old style. the deprecated parquet-tools reads these files fine b/c it's the older style. i started coding up a workaround in flink-parquet and parquet-cli, but stopped. we really should just move on at this point, imho. protobufs often have repeated primitives and maps now, so it just makes sense to move on at this point. we should keep the flag around and let people override it back to being backwards compatible though. i have the code written and can submit a PR if you'd like. i'm not an expert in parquet though, so i'm unclear as to the deep downstream ramifications of this change, so i would love to get feedback in this area. was: https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps and lists in a spec compliant way. however, to not break existing libraries, a flag was introduced and defaulted the write behavior to NOT use the specs compliant writes. it's been over 5 years, and people should be really off of it. so much so, that trying to use the new parquet-cli tool to read parquet files generated by flink using doesn't work b/c it's hard coded to never allow the old style. the deprecated parquet-tools reads these files fine b/c it's the older style. i started coding up a workaround in flink-parquet and parquet-cli, but stopped. we really should just move on at this point, imho. protobufs often have repeated primitives and maps now, so it just makes sense to move on at this point. we should keep the flag around and let people override it back to being backwards compatible though. i have the code written and can submit a PR if you'd like. i'm not an expert in parquet though, so i'm unclear as to the deep downstream ramifications of this change, so i would love to get feedback in this area. > make the default behavior for proto writing not-backwards compatible > > > Key: PARQUET-2180 > URL: https://issues.apache.org/jira/browse/PARQUET-2180 > Project: Parquet > Issue Type: Improvement > Components: parquet-protobuf >Reporter: J Y >Priority: Minor > > https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps > and lists in a spec compliant way. however, to not break existing libraries, > a flag was introduced and defaulted the write behavior to NOT use the specs > compliant writes. > it's been over 5 years, and people should be really off of it. so much so, > that trying to use the new parquet-cli tool to read parquet files generated > by flink doesn't work b/c it's hard coded to never allow the old style. the > deprecated parquet-tools reads these files fine b/c it's the older style. > i started coding up a workaround in flink-parquet and parquet-cli, but > stopped. we really should just move on at this point, imho. protobufs often > have repeated primitives and maps now, so it just makes sense to move on at > this point. we should keep the flag around and let people override it back > to being backwards compatible though. > i have the code written and can submit a PR if you'd like. > i'm not an expert in parquet though, so i'm unclear as to the deep downstream > ramifications of this change, so i would love to get feedback in this area. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2180) make the default behavior for proto writing not-backwards compatible
J Y created PARQUET-2180: Summary: make the default behavior for proto writing not-backwards compatible Key: PARQUET-2180 URL: https://issues.apache.org/jira/browse/PARQUET-2180 Project: Parquet Issue Type: Improvement Components: parquet-protobuf Reporter: J Y https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps and lists in a spec compliant way. however, to not break existing libraries, a flag was introduced and defaulted the write behavior to NOT use the specs compliant writes. it's been over 5 years, and people should be really off of it. so much so, that trying to use the new parquet-cli tool to read parquet files generated by flink using doesn't work b/c it's hard coded to never allow the old style. the deprecated parquet-tools reads these files fine b/c it's the older style. i started coding up a workaround in flink-parquet and parquet-cli, but stopped. we really should just move on at this point, imho. protobufs often have repeated primitives and maps now, so it just makes sense to move on at this point. we should keep the flag around and let people override it back to being backwards compatible though. i have the code written and can submit a PR if you'd like. i'm not an expert in parquet though, so i'm unclear as to the deep downstream ramifications of this change, so i would love to get feedback in this area. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2179) Add a test for skipping repeated fields
fatemah created PARQUET-2179: Summary: Add a test for skipping repeated fields Key: PARQUET-2179 URL: https://issues.apache.org/jira/browse/PARQUET-2179 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: fatemah The existing test only tests non-repeated fields. Adding a test for repeated fields to make it clear that it is skipping values and not records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597667#comment-17597667 ] ASF GitHub Bot commented on PARQUET-758: gszadovszky commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231323733 > > It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. > > While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy: https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190 (`npy_half` is just an unsigned 16-bit integer in this context) It is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See [PARQUET-1222(https://issues.apache.org/jira/browse/PARQUET-1222) for details. > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] gszadovszky commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type
gszadovszky commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231323733 > > It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. > > While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy: https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190 (`npy_half` is just an unsigned 16-bit integer in this context) It is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See [PARQUET-1222(https://issues.apache.org/jira/browse/PARQUET-1222) for details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597650#comment-17597650 ] ASF GitHub Bot commented on PARQUET-758: pitrou commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231300374 > It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy: https://github.com/numpy/numpy/blob/main/numpy/core/src/npymath/halffloat.c#L169-L190 (`npy_half` is just an unsigned 16-bit integer in this context) > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] pitrou commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type
pitrou commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231300374 > It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy: https://github.com/numpy/numpy/blob/main/numpy/core/src/npymath/halffloat.c#L169-L190 (`npy_half` is just an unsigned 16-bit integer in this context) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597640#comment-17597640 ] ASF GitHub Bot commented on PARQUET-758: gszadovszky commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231284535 > It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts? I think the basic idea behind having physical and logical types is to support forward compatibility since we can always represent (somehow) a long-existing physical type while logical types are getting extended. Parquet-mr should work fine with "unknown" logical types by reading it back as an un-annotated physical vale (a `Binary` with two bytes in this case). So, if the community supports having a half-precision floating point type I would vote on specifying it as a logical type. The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. (BTW the sorting order of floating point numbers are still an open issue: [PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222)) > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] gszadovszky commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type
gszadovszky commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231284535 > It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts? I think the basic idea behind having physical and logical types is to support forward compatibility since we can always represent (somehow) a long-existing physical type while logical types are getting extended. Parquet-mr should work fine with "unknown" logical types by reading it back as an un-annotated physical vale (a `Binary` with two bytes in this case). So, if the community supports having a half-precision floating point type I would vote on specifying it as a logical type. The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. (BTW the sorting order of floating point numbers are still an open issue: [PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org