[jira] [Updated] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli
[ https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-1792: - Fix Version/s: 1.12.0 > Add 'mask' command to parquet-tools/parquet-cli > --- > > Key: PARQUET-1792 > URL: https://issues.apache.org/jira/browse/PARQUET-1792 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > Some personal data columns need to be masked instead of being > pruned(Parquet-1791). We need a tool to replace the raw data columns with > masked value. The masked value could be hash, null, redact etc. For the > unchanged columns, they should be moved as a whole like 'merge', 'prune' > command in Parquet-tools. > > Implementing this feature in file format is 10X faster than doing it by > rewriting the table data in the query engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli
[ https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373000#comment-17373000 ] Sudhakar J Pyndi commented on PARQUET-1792: --- Could you please confirm the target version for this feature? Thank you > Add 'mask' command to parquet-tools/parquet-cli > --- > > Key: PARQUET-1792 > URL: https://issues.apache.org/jira/browse/PARQUET-1792 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > Some personal data columns need to be masked instead of being > pruned(Parquet-1791). We need a tool to replace the raw data columns with > masked value. The masked value could be hash, null, redact etc. For the > unchanged columns, they should be moved as a whole like 'merge', 'prune' > command in Parquet-tools. > > Implementing this feature in file format is 10X faster than doing it by > rewriting the table data in the query engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
[ https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17372862#comment-17372862 ] Xinli Shang commented on PARQUET-1681: -- We chose to revert the behavior back to 1.8.1. It runs fine for a year or so. We will port the changes soon. > Avro's isElementType() change breaks the reading of some parquet(1.8.1) files > - > > Key: PARQUET-1681 > URL: https://issues.apache.org/jira/browse/PARQUET-1681 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.10.0, 1.9.1, 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Critical > > When using the Avro schema below to write a parquet(1.8.1) file and then read > back by using parquet 1.10.1 without passing any schema, the reading throws > an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. > { > "name": "phones", > "type": [ > "null", > { > "type": "array", > "items": { > "type": "record", > "name": "phones_items", > "fields": [ > > { "name": "phone_number", > "type": [ "null", > "string" ], "default": null > } > ] > } > } > ], > "default": null > } > The code to read is as below > val reader = > AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new* > Configuration).build() > reader.read() > PARQUET-651 changed the method isElementType() by relying on Avro's > checkReaderWriterCompatibility() to check the compatibility. However, > checkReaderWriterCompatibility() consider the ParquetSchema and the > AvroSchema(converted from File schema) as not compatible(the name in avro > schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence > not compatible) . Hence return false and caused the “phone_number” field in > the above schema to be considered as group type which is not true. Then the > exception throws as .asGroupType(). > I didn’t try writing via parquet 1.10.1 would reproduce the same problem or > not. But it could because the translation of Avro schema to Parquet schema is > not changed(didn’t verify yet). > I hesitate to revert PARQUET-651 because it solved several problems. I would > like to hear the community's thoughts on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)