[GitHub] [parquet-mr] maccamlc commented on pull request #798: PARQUET-1879 MapKeyValue is not a valid Logical Type

2020-06-29 Thread GitBox


maccamlc commented on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-651009716


   > @maccamlc,
   > 
   > The main problem I think is that the spec does not say anything about how 
the thrift objects shall be used. The specification is about the semantics of 
the schema and it is described using the parquet schema _language_. But, in the 
file there is no such _language_, we only have [thrift 
objects](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
   > When the specification says something about the _logical types_ (e.g. 
`MAP`) it does not say anything about which thrift structure should be used 
(the converted type 
[`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L53)
 or the logical type 
[`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L324)).
   > We added the new logical type structures in the thrift to support enhanced 
ways to specify _logical types_ (e.g. 
[`TimeStampType`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L272)).
 The idea for backward compatibility was to write the old converted types 
wherever it make sense (the semantics of the actual _logical type_ is the same 
as was before) along with the new logical type structures. So, related to 
`MAP_KEY_VALUE`, I think, we shall write it at the correct place if it was 
written before (prior to `1.11.0`) and it helps for other readers but do not 
expect it to be there.
   > 
   > Cheers,
   > Gabor
   
   Sounds good @gszadovszky . Thanks for some clarification.
   
   Therefore, depending on any other comments from other reviewers, it seems 
this PR is still ready to merge as-is :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] maccamlc commented on pull request #798: PARQUET-1879 MapKeyValue is not a valid Logical Type

2020-06-27 Thread GitBox


maccamlc commented on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650527774


   @gszadovszky before this gets merged, I just wanted to clarify something 
myself after looking more into the format spec, that might tidy this issue up 
further.
   
   * Is MAP_KEY_VALUE required to still be written as the Converted Type when 
creating new files?
   
   From what I could see from some older issues and the 
[backwards-compatibility 
rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 it seems to have always been an optional type, and also used incorrectly in 
the past.
   
   It appears that older versions of Parquet would be able to read the Map type 
in the schema without MAP_KEY_VALUE.
   
   If that is true, I would probably suggest pushing this [additional 
commit](https://github.com/maccamlc/parquet-mr/commit/3f774d123997a4c63631185ca409550ca03b960d)
 that I tested, onto this PR.
   
   It would mean that any unexpected uses of LogicalType.MAP_KEY_VALUE would 
result in UNKNOWN being written to the file. But it is removed from the 
ConversionPatterns path, meaning that my case of this occuring when converting 
an Avro schema is still fixed, and tested.
   
   Let me know if believe this might be the preferred fix, or if what have 
already done is better.
   
   Thanks
   Matt



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] maccamlc commented on pull request #798: PARQUET-1879 MapKeyValue is not a valid Logical Type

2020-06-25 Thread GitBox


maccamlc commented on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-649482253


   > Thank you for creating the backward compatibility test for Map. It should 
have been existed already.
   > Unfortunately, this way you do not properly test backward compatibility. 
The problem is you cannot generate an "old" file with the "new" library. To be 
more precise the message parser is more for convenience and not used while 
reading/writing a parquet file. When you say you are testing converted type it 
is not really true because the parser tries to read logical types at the first 
place. Also the parquet writer writes both logical types and converted types so 
you cannot validate old files that have only converted types.
   > I would suggest adding tests that covers the examples in the 
[spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 by creating the thrift generated format objects and convert them by 
`ParquetMetadataConverter` just like you did it in 
`TestParquetMetadataConverter.testMapLogicalType`. Maybe, these tests would fit 
better in that class as well.
   > I should have been described this before you've implemented this test. I 
am sorry about that.
   > 
   > Please don't force push your changes because it makes harder to track the 
review. The committer will squash the PR before merging it anyway.
   
   
   
   > Thank you for creating the backward compatibility test for Map. It should 
have been existed already.
   > Unfortunately, this way you do not properly test backward compatibility. 
The problem is you cannot generate an "old" file with the "new" library. To be 
more precise the message parser is more for convenience and not used while 
reading/writing a parquet file. When you say you are testing converted type it 
is not really true because the parser tries to read logical types at the first 
place. Also the parquet writer writes both logical types and converted types so 
you cannot validate old files that have only converted types.
   > I would suggest adding tests that covers the examples in the 
[spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 by creating the thrift generated format objects and convert them by 
`ParquetMetadataConverter` just like you did it in 
`TestParquetMetadataConverter.testMapLogicalType`. Maybe, these tests would fit 
better in that class as well.
   > I should have been described this before you've implemented this test. I 
am sorry about that.
   > 
   > Please don't force push your changes because it makes harder to track the 
review. The committer will squash the PR before merging it anyway.
   
   Apologies for the force push. Good to know that squashed on commit.
   
   And thanks for the detailed reply. 
   
   I think I got it this time :) 
   
   Tests were moved into TestParquetMetadataConverter and for the old format 
test, building the metadata through Thrift SchemaElements.
   
   Regards,
   Matt



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] maccamlc commented on pull request #798: PARQUET-1879 MapKeyValue is not a valid Logical Type

2020-06-24 Thread GitBox


maccamlc commented on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-649202469


   > Thanks for working on this.
   > 
   > You have changed every naming from `"map"` to `"key_value"` in the tests. 
This is good for the expected data but we should keep testing `"map"` at the 
read path as well. Based on the 
[spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 it is still acceptable.
   > 
   > I am not an expert in this topic so I would be happy if someone else also 
could review this.
   
   @gszadovszky no problem. I have tried to add a test to verify the 
backwards-compatibile reading. Added TestReadWriteMapKeyValue to the commit. 
   
   Not sure if this is the correct way, but parsing a schema to go the logical 
type path with key_value and no MAP_KEY_VALUE type, then another with map and 
with the MAP_KEY_VALUE type. 
   
   From what I can tell the name is not actually verified anywhere (I tried 
with random name value too :) ), but both test paths are successful.
   
   Hopefully it's ok, but let me know if might need to go a bit deeper 
somewhere else



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org