[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated files

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2181:
-
Summary: parquet-cli fails at supporting parquet-protobuf generated files  
(was: parquet-cli fails at supporting parquet-protobuf generated schemas that 
have repeated primitives in them)

> parquet-cli fails at supporting parquet-protobuf generated files
> 
>
> Key: PARQUET-2181
> URL: https://issues.apache.org/jira/browse/PARQUET-2181
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Reporter: J Y
>Priority: Critical
> Attachments: samples.tgz
>
>
> i generated a parquet file using a protobuf with this proto definition:
> {code:java}
> message IndexPath {
>   // Index of item in path.
>   repeated int32 index = 1;
> }
> message SomeEvent {
>   // truncated/obfuscated wrapper
>   optional IndexPath client_position = 1;
> }
> {code}
> this gets translated to the following parquet schema using the new compliant 
> schema for lists:
> {code:java}
> message SomeEvent {
>   optional group client_position = 1 {
>     optional group index (LIST) = 1 {
>       repeated group list {
>         required int32 element;
>       }
>     }
>   }
> }
> {code}
> this causes parquet-cli cat to barf on a file containing these events:
> {quote}java.lang.RuntimeException: Failed on record 0
>         at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
>         at org.apache.parquet.cli.Main.run(Main.java:157)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.parquet.cli.Main.main(Main.java:187)
> Caused by: java.lang.ClassCastException: required int32 element is not a group
>         at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
>         at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
>         at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
>         at 
> org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
>         at 
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
>         at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
>         at 
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
>         at 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
>         at 
> org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
>         at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
>         at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
>         at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
>         ... 3 more
> {quote}
> using the old parquet-tools binary to cat this file works fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2181:
-
Priority: Critical  (was: Major)

> parquet-cli fails at supporting parquet-protobuf generated schemas that have 
> repeated primitives in them
> 
>
> Key: PARQUET-2181
> URL: https://issues.apache.org/jira/browse/PARQUET-2181
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Reporter: J Y
>Priority: Critical
> Attachments: samples.tgz
>
>
> i generated a parquet file using a protobuf with this proto definition:
> {code:java}
> message IndexPath {
>   // Index of item in path.
>   repeated int32 index = 1;
> }
> message SomeEvent {
>   // truncated/obfuscated wrapper
>   optional IndexPath client_position = 1;
> }
> {code}
> this gets translated to the following parquet schema using the new compliant 
> schema for lists:
> {code:java}
> message SomeEvent {
>   optional group client_position = 1 {
>     optional group index (LIST) = 1 {
>       repeated group list {
>         required int32 element;
>       }
>     }
>   }
> }
> {code}
> this causes parquet-cli cat to barf on a file containing these events:
> {quote}java.lang.RuntimeException: Failed on record 0
>         at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
>         at org.apache.parquet.cli.Main.run(Main.java:157)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.parquet.cli.Main.main(Main.java:187)
> Caused by: java.lang.ClassCastException: required int32 element is not a group
>         at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
>         at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
>         at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
>         at 
> org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
>         at 
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
>         at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
>         at 
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
>         at 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
>         at 
> org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
>         at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
>         at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
>         at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
>         ... 3 more
> {quote}
> using the old parquet-tools binary to cat this file works fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598196#comment-17598196
 ] 

J Y commented on PARQUET-2181:
--

i've attached some parquet files that all read fine using parquet-tools (both 
the deprecated version from parquet-mr and the [one written in 
python|https://github.com/ktrueda/parquet-tools]) *but do not read at all using 
parquet-cli*.  parquet-cli's meta command works fine.

it turns out, there's other stack traces when trying to use parquet-cli to read 
these files.  in addition to the repeated primitive issue highlighted 
originally, there's 2 other issues like the following exhibited in these files:

{quote}--- 
./raw/delivery-log/dt=2022-08-10/hour=04/part-02a95a0e-bd21-4476-9d0f-d1896687b12a-0
 
Argument error: Map key type must be binary (UTF8): required int32 key  

 

--- 
./raw/user/dt=2022-08-10/hour=04/part-8cac1d0c-fb7f-4a9a-b77e-b3dd59f89333-0
 
Unknown error   

 
java.lang.RuntimeException: Failed on record 0  

 
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)   

 
at org.apache.parquet.cli.Main.run(Main.java:157)   

 
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

 
at org.apache.parquet.cli.Main.main(Main.java:187)  

 
Caused by: org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema 
mismatch: Avro field 'null_value' not found
at 
org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:221)
   {quote} 

is using AvroReadSupport and AvroRecrodConverter the right way to go for 
protobufs?  it looks like the parquet-tools that was deprecated in 1.12.3+ 
doesn't use the parquet-avro approach to reading (it uses [its own 
SimpleReadSupport 
approach|https://github.com/apache/parquet-mr/tree/apache-parquet-1.12.2/parquet-tools-deprecated/src/main/java/org/apache/parquet/tools/read]),
 which makes sense to me given the underlying schema and data written in 
parquet-protobuf generated files are not avro...

should we move parquet-cli back to SimpleReadSupport instead of relying on what 
appears to be a broken AvroReadSupport when dealing with proto generated files?

> parquet-cli fails at supporting parquet-protobuf generated schemas that have 
> repeated primitives in them
> 
>
> Key: PARQUET-2181
> URL: https://issues.apache.org/jira/browse/PARQUET-2181
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Reporter: J Y
>Priority: Major
> Attachments: samples.tgz
>
>
> i generated a parquet file using a protobuf with this proto definition:
> {code:java}
> message IndexPath {
>   // Index of item in path.
>   repeated int32 index = 1;
> }
> message SomeEvent {
>   // truncated/obfuscated wrapper
>   optional IndexPath client_position = 1;
> }
> {code}
> this gets translated to the following parquet schema using the new compliant 
> schema for lists:
> {code:java}
> message SomeEvent {
>   optional group client_position = 1 {
>     optional group index (LIST) = 1 {
>       repeated group list {
>         required int32 element;
>       }
>     }
>   }
> }
> {code}
> this causes parquet-cli cat to barf on a file containing these events:
> {quote}java.lang.RuntimeException: Failed on record 0
>         at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
>         at org.apache.parquet.cli.Main.run(Main.java:157)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.parquet.cli.Main.main(Main.java:187)
> Caused by: java.lang.ClassCastException: required int32 element is not a group
>         at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
>         at 
> org.

[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2181:
-
Attachment: samples.tgz

> parquet-cli fails at supporting parquet-protobuf generated schemas that have 
> repeated primitives in them
> 
>
> Key: PARQUET-2181
> URL: https://issues.apache.org/jira/browse/PARQUET-2181
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Reporter: J Y
>Priority: Major
> Attachments: samples.tgz
>
>
> i generated a parquet file using a protobuf with this proto definition:
> {code:java}
> message IndexPath {
>   // Index of item in path.
>   repeated int32 index = 1;
> }
> message SomeEvent {
>   // truncated/obfuscated wrapper
>   optional IndexPath client_position = 1;
> }
> {code}
> this gets translated to the following parquet schema using the new compliant 
> schema for lists:
> {code:java}
> message SomeEvent {
>   optional group client_position = 1 {
>     optional group index (LIST) = 1 {
>       repeated group list {
>         required int32 element;
>       }
>     }
>   }
> }
> {code}
> this causes parquet-cli cat to barf on a file containing these events:
> {quote}java.lang.RuntimeException: Failed on record 0
>         at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
>         at org.apache.parquet.cli.Main.run(Main.java:157)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.parquet.cli.Main.main(Main.java:187)
> Caused by: java.lang.ClassCastException: required int32 element is not a group
>         at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
>         at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
>         at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
>         at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
>         at 
> org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
>         at 
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
>         at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
>         at 
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
>         at 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
>         at 
> org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
>         at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
>         at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
>         at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
>         ... 3 more
> {quote}
> using the old parquet-tools binary to cat this file works fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2022-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598185#comment-17598185
 ] 

ASF GitHub Bot commented on PARQUET-758:


gszadovszky commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1232491783

   I've came up with this ordering thing because we specify it for every 
logical types. (Unfortunately we don't do this for primitive types.) Therefore, 
I would expect to have the order specified for this new logical type as well 
which is not trivial and requires to solve 
[PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222). At least we 
should add a note about this issue.
   
   > It seems this would require parquet implementations to null out statistics 
for logical types that they don't support, does parquet-mr do that today?
   
   I do not have the proper environment to test it but based on the code we do 
not handle unknown logical types well in parquet-mr. I think it handles unknown 
logical types as if they were not there at all which is fine from the data 
point of view but we would blindly use the statistics which may cause data 
loss. Created 
[PARQUET-2182](https://issues.apache.org/jira/browse/PARQUET-2182) to track 
this.




> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] gszadovszky commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2022-08-30 Thread GitBox


gszadovszky commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1232491783

   I've came up with this ordering thing because we specify it for every 
logical types. (Unfortunately we don't do this for primitive types.) Therefore, 
I would expect to have the order specified for this new logical type as well 
which is not trivial and requires to solve 
[PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222). At least we 
should add a note about this issue.
   
   > It seems this would require parquet implementations to null out statistics 
for logical types that they don't support, does parquet-mr do that today?
   
   I do not have the proper environment to test it but based on the code we do 
not handle unknown logical types well in parquet-mr. I think it handles unknown 
logical types as if they were not there at all which is fine from the data 
point of view but we would blindly use the statistics which may cause data 
loss. Created 
[PARQUET-2182](https://issues.apache.org/jira/browse/PARQUET-2182) to track 
this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (PARQUET-2182) Handle unknown logical types

2022-08-30 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-2182:
-

 Summary: Handle unknown logical types
 Key: PARQUET-2182
 URL: https://issues.apache.org/jira/browse/PARQUET-2182
 Project: Parquet
  Issue Type: Bug
Reporter: Gabor Szadovszky


New logical types introduced in parquet-format shall be properly handled in 
parquet-mr releases that are not aware of this new type. In this case we shall 
read the data as if only the primitive type would be defined (without a logical 
type) with one exception: We shall not use min/max based statistics (including 
column indexes) since we don't know the proper ordering of that type.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2181:
-
Description: 
i generated a parquet file using a protobuf with this proto definition:

{code:java}
message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{code}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{code:java}
message SomeEvent {
  optional group client_position = 1 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
      }
    }
  }
}
{code}

this causes parquet-cli cat to barf on a file containing these events:
{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
        at 
org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
        at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
        at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
        at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
        at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
        at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
        ... 3 more
{quote}
using the old parquet-tools binary to cat this file works fine.

  was:
i generated a parquet file using a protobuf with this proto definition:

{code:java}
message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{code}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{code:java}
message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
      }
    }
  }
}
{code}

this causes parquet-cli cat to barf on a file containing these events:
{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        a

[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2181:
-
Description: 
i generated a parquet file using a protobuf with this proto definition:

{code:java}
message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
optional IndexPath client_position = 1;
}
{code}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{code:java}
message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
      }
    }
  }
}
{code}

this causes parquet-cli cat to barf on a file containing these events:
{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
        at 
org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
        at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
        at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
        at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
        at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
        at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
        ... 3 more
{quote}
using the old parquet-tools binary to cat this file works fine.

  was:
i generated a parquet file using a protobuf with this proto definition:
{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
optional IndexPath client_position = 1;
}
{quote}
this gets translated to the following parquet schema using the new compliant 
schema for lists:
{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
      }
    }
  }
}{quote}
this causes parquet-cli cat to barf on a file containing these events:
{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.pa

[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2181:
-
Description: 
i generated a parquet file using a protobuf with this proto definition:

{code:java}
message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{code}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{code:java}
message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
      }
    }
  }
}
{code}

this causes parquet-cli cat to barf on a file containing these events:
{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
        at 
org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
        at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
        at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
        at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
        at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
        at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
        ... 3 more
{quote}
using the old parquet-tools binary to cat this file works fine.

  was:
i generated a parquet file using a protobuf with this proto definition:

{code:java}
message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
optional IndexPath client_position = 1;
}
{code}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{code:java}
message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
      }
    }
  }
}
{code}

this causes parquet-cli cat to barf on a file containing these events:
{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at

[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2181:
-
Description: 
i generated a parquet file using a protobuf with this proto definition:
{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
optional IndexPath client_position = 1;
}
{quote}
this gets translated to the following parquet schema using the new compliant 
schema for lists:
{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
      }
    }
  }
}{quote}
this causes parquet-cli cat to barf on a file containing these events:
{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
        at 
org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
        at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
        at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
        at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
        at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
        at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
        ... 3 more
{quote}
using the old parquet-tools binary to cat this file works fine.

  was:
i generated a parquet file using a protobuf with this proto definition:

{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{quote}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
  }
    }
  }
}{quote}

this causes parquet-cli cat to barf on a file containing these events:

{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.

[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2181:
-
Description: 
i generated a parquet file using a protobuf with this proto definition:

{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{quote}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
  }
    }
  }
}{quote}

this causes parquet-cli cat to barf on a file containing these events:

{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
        at 
org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
        at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
        at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
        at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
        at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
        at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
        ... 3 more{quote}

using the old parquet-tools binary to cat this file works fine.

  was:
i generated a parquet file using a protobuf with this proto definition:

{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{quote}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {{}}
        required int32 element;
  }
    }
  }
}{quote}

this causes parquet-cli cat to barf on a file containing these events:

{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apach

[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2181:
-
Description: 
i generated a parquet file using a protobuf with this proto definition:

{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{quote}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list }
        required int32 element;
  }
    }
  }
}{quote}

this causes parquet-cli cat to barf on a file containing these events:

{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
        at 
org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
        at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
        at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
        at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
        at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
        at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
        ... 3 more{quote}

using the old parquet-tools binary to cat this file works fine.

  was:
i generated a parquet file using a protobuf with this proto definition:

{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{quote}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
  }
    }
  }
}{quote}

this causes parquet-cli cat to barf on a file containing these events:

{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.p

[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2181:
-
Description: 
i generated a parquet file using a protobuf with this proto definition:

{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{quote}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
  }
    }
  }
}{quote}

this causes parquet-cli cat to barf on a file containing these events:

{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
        at 
org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
        at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
        at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
        at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
        at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
        at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
        ... 3 more{quote}

using the old parquet-tools binary to cat this file works fine.

  was:
i generated a parquet file using a protobuf with this proto definition:

{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{quote}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list }
        required int32 element;
  }
    }
  }
}{quote}

this causes parquet-cli cat to barf on a file containing these events:

{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.p

[jira] [Updated] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2181:
-
Description: 
i generated a parquet file using a protobuf with this proto definition:

{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{quote}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {{}}
        required int32 element;
  }
    }
  }
}{quote}

this causes parquet-cli cat to barf on a file containing these events:

{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
        at 
org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
        at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
        at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
        at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
        at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
        at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
        ... 3 more{quote}

using the old parquet-tools binary to cat this file works fine.

  was:
i generated a parquet file using a protobuf with this proto definition:

{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{quote}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
  }
    }
  }
}{quote}

this causes parquet-cli cat to barf on a file containing these events:

{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apach

[jira] [Created] (PARQUET-2181) parquet-cli fails at supporting parquet-protobuf generated schemas that have repeated primitives in them

2022-08-30 Thread J Y (Jira)
J Y created PARQUET-2181:


 Summary: parquet-cli fails at supporting parquet-protobuf 
generated schemas that have repeated primitives in them
 Key: PARQUET-2181
 URL: https://issues.apache.org/jira/browse/PARQUET-2181
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cli
Reporter: J Y


i generated a parquet file using a protobuf with this proto definition:

{quote}message IndexPath {
  // Index of item in path.
  repeated int32 index = 1;
}

message SomeEvent {
  // truncated/obfuscated wrapper
  optional IndexPath client_position = 1;
}
{quote}

this gets translated to the following parquet schema using the new compliant 
schema for lists:

{quote}message SomeEvent {
  optional group client_position = 24 {
    optional group index (LIST) = 1 {
      repeated group list {
        required int32 element;
  }
    }
  }
}{quote}

this causes parquet-cli cat to barf on a file containing these events:

{quote}java.lang.RuntimeException: Failed on record 0
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
        at org.apache.parquet.cli.Main.run(Main.java:157)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:187)
Caused by: java.lang.ClassCastException: required int32 element is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
        at 
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:539)
        at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:489)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:137)
        at 
org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:91)
        at 
org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
        at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:190)
        at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:166)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
        at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
        at org.apache.parquet.cli.BaseCommand$1$1.(BaseCommand.java:344)
        at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
        ... 3 more{quote}

using the old parquet-tools binary to cat this file works fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598174#comment-17598174
 ] 

ASF GitHub Bot commented on PARQUET-1711:
-

jinyius commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r959168794


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {
+// Circular dependency, skip
+LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),

Review Comment:
   i had been working on this issue as well and arrived at a similar solution 
to this one (however, without skipping/losing data) and linked to the prs in 
this pr conversation.  ptal, and if you folks prefer it, i can submit a merge 
against head and close out this pr.





> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jinyius commented on a diff in pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

2022-08-30 Thread GitBox


jinyius commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r959168794


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {
+// Circular dependency, skip
+LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),

Review Comment:
   i had been working on this issue as well and arrived at a similar solution 
to this one (however, without skipping/losing data) and linked to the prs in 
this pr conversation.  ptal, and if you folks prefer it, i can submit a merge 
against head and close out this pr.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598171#comment-17598171
 ] 

ASF GitHub Bot commented on PARQUET-1711:
-

jinyius commented on PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#issuecomment-1232470935

   hmm... what timing.  i actually have a pr for what i think is a more robust 
approach that truncates at an arbitrary recursion depth by putting the 
remaining recursion levels into a binary blob.  this approach lets downstream 
querying things query the non-truncated parts fine, and allows for udfs to be 
defined to reinstantiate the truncated recursed fields.
   
   i didn't submit the pr for merge quite yet b/c i'm busy trying to finish off 
the overall project i needed this for at work, so it's just coded against 
1.12.3 and not head.
   
   ptal, and if everyone likes my proposal, i can spend a few cycles and move 
it to head:
   
   schema converter pr:
- https://github.com/promotedai/parquet-mr/pull/1
   write support pr:
   -  https://github.com/promotedai/parquet-mr/pull/2




> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jinyius commented on pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

2022-08-30 Thread GitBox


jinyius commented on PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#issuecomment-1232470935

   hmm... what timing.  i actually have a pr for what i think is a more robust 
approach that truncates at an arbitrary recursion depth by putting the 
remaining recursion levels into a binary blob.  this approach lets downstream 
querying things query the non-truncated parts fine, and allows for udfs to be 
defined to reinstantiate the truncated recursed fields.
   
   i didn't submit the pr for merge quite yet b/c i'm busy trying to finish off 
the overall project i needed this for at work, so it's just coded against 
1.12.3 and not head.
   
   ptal, and if everyone likes my proposal, i can spend a few cycles and move 
it to head:
   
   schema converter pr:
- https://github.com/promotedai/parquet-mr/pull/1
   write support pr:
   -  https://github.com/promotedai/parquet-mr/pull/2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2022-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598151#comment-17598151
 ] 

ASF GitHub Bot commented on PARQUET-758:


emkornfield commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1232420353

   > t is not that trivial. For the half-precision floating point numbers we do 
not have native support for either cpp or java so we can define the total 
ordering as we want. But we shall do the same for the existing floating point 
numbers that most languages have native support. Even though they are following 
the same standard the total ordering either does not exist or have different 
implementations. See 
[PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222) for details.
   
   I think these are orthogonal.  I might be missing something but it seems 
like it would not be to hard to case float16 to float in java/cpp and do the 
comparison in that space and cast it back down.  This might not be the most 
efficient implementation but would be straightforward? I am probably missing 
something.  It would be nice to resolve 
[PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222) so the same 
semantics would apply to all floating point numbers.  
   
   > The tricky thing will be the implementations. Even though parquet-mr does 
not really care about converting the values according to their logical types we 
still need to care about the logical types at the ordering (min/max values in 
the statistics).
   
   It seems this would require parquet implementations to null out statistics 
for logical types that they don't support, does parquet-mr do that today?
   
   
   




> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] emkornfield commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2022-08-30 Thread GitBox


emkornfield commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1232420353

   > t is not that trivial. For the half-precision floating point numbers we do 
not have native support for either cpp or java so we can define the total 
ordering as we want. But we shall do the same for the existing floating point 
numbers that most languages have native support. Even though they are following 
the same standard the total ordering either does not exist or have different 
implementations. See 
[PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222) for details.
   
   I think these are orthogonal.  I might be missing something but it seems 
like it would not be to hard to case float16 to float in java/cpp and do the 
comparison in that space and cast it back down.  This might not be the most 
efficient implementation but would be straightforward? I am probably missing 
something.  It would be nice to resolve 
[PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222) so the same 
semantics would apply to all floating point numbers.  
   
   > The tricky thing will be the implementations. Even though parquet-mr does 
not really care about converting the values according to their logical types we 
still need to care about the logical types at the ordering (min/max values in 
the statistics).
   
   It seems this would require parquet implementations to null out statistics 
for logical types that they don't support, does parquet-mr do that today?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (PARQUET-2180) make the default behavior for proto writing not-backwards compatible

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2180:
-
Description: 
https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps 
and lists in a spec compliant way.  however, to not break existing libraries, a 
flag was introduced and defaulted the write behavior to NOT use the specs 
compliant writes.

it's been over 5 years, and people should be really off of it.  so much so, 
that trying to use the new parquet-cli tool to read parquet files generated by 
flink doesn't work b/c it's hard coded to never allow the old style.  the 
deprecated parquet-tools reads these files fine b/c it's the older style.

i started coding up a workaround in flink-parquet and parquet-cli, but stopped. 
 we really should just move on at this point, imho.  protobufs often have 
repeated primitives and maps, so it's more pressing to get proper specs 
compliant support for it now.  we should keep the flag around and let people 
override it back to being backwards compatible though.

i have the code written and can submit a PR if you'd like.

i'm not an expert in parquet though, so i'm unclear as to the deep downstream 
ramifications of this change, so i would love to get feedback in this area.

  was:
https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps 
and lists in a spec compliant way.  however, to not break existing libraries, a 
flag was introduced and defaulted the write behavior to NOT use the specs 
compliant writes.

it's been over 5 years, and people should be really off of it.  so much so, 
that trying to use the new parquet-cli tool to read parquet files generated by 
flink doesn't work b/c it's hard coded to never allow the old style.  the 
deprecated parquet-tools reads these files fine b/c it's the older style.

i started coding up a workaround in flink-parquet and parquet-cli, but stopped. 
 we really should just move on at this point, imho.  protobufs often have 
repeated primitives and maps now, so it just makes sense to move on at this 
point.  we should keep the flag around and let people override it back to being 
backwards compatible though.

i have the code written and can submit a PR if you'd like.

i'm not an expert in parquet though, so i'm unclear as to the deep downstream 
ramifications of this change, so i would love to get feedback in this area.


> make the default behavior for proto writing not-backwards compatible
> 
>
> Key: PARQUET-2180
> URL: https://issues.apache.org/jira/browse/PARQUET-2180
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-protobuf
>Reporter: J Y
>Priority: Minor
>
> https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps 
> and lists in a spec compliant way.  however, to not break existing libraries, 
> a flag was introduced and defaulted the write behavior to NOT use the specs 
> compliant writes.
> it's been over 5 years, and people should be really off of it.  so much so, 
> that trying to use the new parquet-cli tool to read parquet files generated 
> by flink doesn't work b/c it's hard coded to never allow the old style.  the 
> deprecated parquet-tools reads these files fine b/c it's the older style.
> i started coding up a workaround in flink-parquet and parquet-cli, but 
> stopped.  we really should just move on at this point, imho.  protobufs often 
> have repeated primitives and maps, so it's more pressing to get proper specs 
> compliant support for it now.  we should keep the flag around and let people 
> override it back to being backwards compatible though.
> i have the code written and can submit a PR if you'd like.
> i'm not an expert in parquet though, so i'm unclear as to the deep downstream 
> ramifications of this change, so i would love to get feedback in this area.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2180) make the default behavior for proto writing not-backwards compatible

2022-08-30 Thread J Y (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J Y updated PARQUET-2180:
-
Description: 
https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps 
and lists in a spec compliant way.  however, to not break existing libraries, a 
flag was introduced and defaulted the write behavior to NOT use the specs 
compliant writes.

it's been over 5 years, and people should be really off of it.  so much so, 
that trying to use the new parquet-cli tool to read parquet files generated by 
flink doesn't work b/c it's hard coded to never allow the old style.  the 
deprecated parquet-tools reads these files fine b/c it's the older style.

i started coding up a workaround in flink-parquet and parquet-cli, but stopped. 
 we really should just move on at this point, imho.  protobufs often have 
repeated primitives and maps now, so it just makes sense to move on at this 
point.  we should keep the flag around and let people override it back to being 
backwards compatible though.

i have the code written and can submit a PR if you'd like.

i'm not an expert in parquet though, so i'm unclear as to the deep downstream 
ramifications of this change, so i would love to get feedback in this area.

  was:
https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps 
and lists in a spec compliant way.  however, to not break existing libraries, a 
flag was introduced and defaulted the write behavior to NOT use the specs 
compliant writes.

it's been over 5 years, and people should be really off of it.  so much so, 
that trying to use the new parquet-cli tool to read parquet files generated by 
flink using doesn't work b/c it's hard coded to never allow the old style.  the 
deprecated parquet-tools reads these files fine b/c it's the older style.

i started coding up a workaround in flink-parquet and parquet-cli, but stopped. 
 we really should just move on at this point, imho.  protobufs often have 
repeated primitives and maps now, so it just makes sense to move on at this 
point.  we should keep the flag around and let people override it back to being 
backwards compatible though.

i have the code written and can submit a PR if you'd like.

i'm not an expert in parquet though, so i'm unclear as to the deep downstream 
ramifications of this change, so i would love to get feedback in this area.


> make the default behavior for proto writing not-backwards compatible
> 
>
> Key: PARQUET-2180
> URL: https://issues.apache.org/jira/browse/PARQUET-2180
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-protobuf
>Reporter: J Y
>Priority: Minor
>
> https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps 
> and lists in a spec compliant way.  however, to not break existing libraries, 
> a flag was introduced and defaulted the write behavior to NOT use the specs 
> compliant writes.
> it's been over 5 years, and people should be really off of it.  so much so, 
> that trying to use the new parquet-cli tool to read parquet files generated 
> by flink doesn't work b/c it's hard coded to never allow the old style.  the 
> deprecated parquet-tools reads these files fine b/c it's the older style.
> i started coding up a workaround in flink-parquet and parquet-cli, but 
> stopped.  we really should just move on at this point, imho.  protobufs often 
> have repeated primitives and maps now, so it just makes sense to move on at 
> this point.  we should keep the flag around and let people override it back 
> to being backwards compatible though.
> i have the code written and can submit a PR if you'd like.
> i'm not an expert in parquet though, so i'm unclear as to the deep downstream 
> ramifications of this change, so i would love to get feedback in this area.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2180) make the default behavior for proto writing not-backwards compatible

2022-08-30 Thread J Y (Jira)
J Y created PARQUET-2180:


 Summary: make the default behavior for proto writing not-backwards 
compatible
 Key: PARQUET-2180
 URL: https://issues.apache.org/jira/browse/PARQUET-2180
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-protobuf
Reporter: J Y


https://issues.apache.org/jira/browse/PARQUET-968 introduced supporting maps 
and lists in a spec compliant way.  however, to not break existing libraries, a 
flag was introduced and defaulted the write behavior to NOT use the specs 
compliant writes.

it's been over 5 years, and people should be really off of it.  so much so, 
that trying to use the new parquet-cli tool to read parquet files generated by 
flink using doesn't work b/c it's hard coded to never allow the old style.  the 
deprecated parquet-tools reads these files fine b/c it's the older style.

i started coding up a workaround in flink-parquet and parquet-cli, but stopped. 
 we really should just move on at this point, imho.  protobufs often have 
repeated primitives and maps now, so it just makes sense to move on at this 
point.  we should keep the flag around and let people override it back to being 
backwards compatible though.

i have the code written and can submit a PR if you'd like.

i'm not an expert in parquet though, so i'm unclear as to the deep downstream 
ramifications of this change, so i would love to get feedback in this area.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2179) Add a test for skipping repeated fields

2022-08-30 Thread fatemah (Jira)
fatemah created PARQUET-2179:


 Summary: Add a test for skipping repeated fields
 Key: PARQUET-2179
 URL: https://issues.apache.org/jira/browse/PARQUET-2179
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: fatemah


The existing test only tests non-repeated fields. Adding a test for repeated 
fields to make it clear that it is skipping values and not records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2022-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597667#comment-17597667
 ] 

ASF GitHub Bot commented on PARQUET-758:


gszadovszky commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231323733

   > > It would not be too easy to implement the half-precision floating point 
comparison logic since java does not have such a primitive type.
   > 
   > While not effortless, it should be relatively easy to adapt one of the 
routines that's available from other open source projects, such as Numpy: 
https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190
 (`npy_half` is just an unsigned 16-bit integer in this context)
   
   It is not that trivial. For the half-precision floating point numbers we do 
not have native support for either cpp or java so we can define the total 
ordering as we want. But we shall do the same for the existing floating point 
numbers that most languages have native support. Even though they are following 
the same standard the total ordering either does not exist or have different 
implementations. See 
[PARQUET-1222(https://issues.apache.org/jira/browse/PARQUET-1222) for details.




> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] gszadovszky commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2022-08-30 Thread GitBox


gszadovszky commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231323733

   > > It would not be too easy to implement the half-precision floating point 
comparison logic since java does not have such a primitive type.
   > 
   > While not effortless, it should be relatively easy to adapt one of the 
routines that's available from other open source projects, such as Numpy: 
https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190
 (`npy_half` is just an unsigned 16-bit integer in this context)
   
   It is not that trivial. For the half-precision floating point numbers we do 
not have native support for either cpp or java so we can define the total 
ordering as we want. But we shall do the same for the existing floating point 
numbers that most languages have native support. Even though they are following 
the same standard the total ordering either does not exist or have different 
implementations. See 
[PARQUET-1222(https://issues.apache.org/jira/browse/PARQUET-1222) for details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2022-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597650#comment-17597650
 ] 

ASF GitHub Bot commented on PARQUET-758:


pitrou commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231300374

   > It would not be too easy to implement the half-precision floating point 
comparison logic since java does not have such a primitive type.
   
   While not effortless, it should be relatively easy to adapt one of the 
routines that's available from other open source projects, such as Numpy:
   
https://github.com/numpy/numpy/blob/main/numpy/core/src/npymath/halffloat.c#L169-L190
   (`npy_half` is just an unsigned 16-bit integer in this context)
   




> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] pitrou commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2022-08-30 Thread GitBox


pitrou commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231300374

   > It would not be too easy to implement the half-precision floating point 
comparison logic since java does not have such a primitive type.
   
   While not effortless, it should be relatively easy to adapt one of the 
routines that's available from other open source projects, such as Numpy:
   
https://github.com/numpy/numpy/blob/main/numpy/core/src/npymath/halffloat.c#L169-L190
   (`npy_half` is just an unsigned 16-bit integer in this context)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2022-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597640#comment-17597640
 ] 

ASF GitHub Bot commented on PARQUET-758:


gszadovszky commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231284535

   > It isn't clear to me if this should be a logical type or a physical type. 
We would need understand if there is different handling for forward 
compatibility purposes (what do we want the desired behavior to be be). I think 
C++ might be lenient here, but don't know about parquet-mr @gszadovszky 
thoughts?
   
   I think the basic idea behind having physical and logical types is to 
support forward compatibility since we can always represent (somehow) a 
long-existing physical type while logical types are getting extended. 
Parquet-mr should work fine with "unknown" logical types by reading it back as 
an un-annotated physical vale (a `Binary` with two bytes in this case).
   So, if the community supports having a half-precision floating point type I 
would vote on specifying it as a logical type.
   
   The tricky thing will be the implementations. Even though parquet-mr does 
not really care about converting the values according to their logical types we 
still need to care about the logical types at the ordering (min/max values in 
the statistics). It would not be too easy to implement the half-precision 
floating point comparison logic since java does not have such a primitive type. 
(BTW the sorting order of floating point numbers are still an open issue: 
[PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222))
   




> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] gszadovszky commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2022-08-30 Thread GitBox


gszadovszky commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1231284535

   > It isn't clear to me if this should be a logical type or a physical type. 
We would need understand if there is different handling for forward 
compatibility purposes (what do we want the desired behavior to be be). I think 
C++ might be lenient here, but don't know about parquet-mr @gszadovszky 
thoughts?
   
   I think the basic idea behind having physical and logical types is to 
support forward compatibility since we can always represent (somehow) a 
long-existing physical type while logical types are getting extended. 
Parquet-mr should work fine with "unknown" logical types by reading it back as 
an un-annotated physical vale (a `Binary` with two bytes in this case).
   So, if the community supports having a half-precision floating point type I 
would vote on specifying it as a logical type.
   
   The tricky thing will be the implementations. Even though parquet-mr does 
not really care about converting the values according to their logical types we 
still need to care about the logical types at the ordering (min/max values in 
the statistics). It would not be too easy to implement the half-precision 
floating point comparison logic since java does not have such a primitive type. 
(BTW the sorting order of floating point numbers are still an open issue: 
[PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222))
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org