[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-10-22 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-714828022


   I just pushed more comments and squashed into one. This is ready for review 
now. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-10-21 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-714205196


   I did a push prematurely. That is not ready for review yet. I will rework on 
it soon and let you know when it is ready for review. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-10-20 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-713009156


   @gszadovszky @ggershinsky Can you have another look? I just switched to us 
Configuration. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-12 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-672999165


   Thanks @ggershinsky @gszadovszky for spending time on the discussion! If 
both of you think adding 'metadata' field is not a good idea, I will just use 
Configuration and see how it goes(majority wins :)). I understand which way is 
better is subjective because both solutions have pros and cons. I am fine with 
not adding the field now and we can revisit it when there are needs that 
require Parquet schema to have a metadata property to better align with other 
schema's definition. 
   
   @ggershinsky for the RPC call configuration, I didn't mean it is used in the 
case of my scenario. It could happen when building the CryptoPropertiesFactory 
implementation, they can call some service to get those configurations 
directly. What I was trying to point out is that there could be many ways to 
get the crypto settings and it should be the user's choice. 
   




This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-11 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-672148979


   @gszadovszky Thanks for the correction of PARQUET-1784!  Regarding the 
serialized/deserialized,  it is not done. I was aware of that when I use 
ExtType.  But that is something we will need to add later. Actually it is 
needed. The use case is that we need to translate the schema inside the Parquet 
files created by upstream like rawdata, to downstream HiveETL metastores like 
HMS.  The linage of the crypto properties will be broken otherwise. Actually 
this is a reason we should add metadata(along with serialized/deserialized) 
instead of using Configuration. 
   
   Creating helper functions helps but the problem is still that we need to add 
a long namespace all the way to the column level(nested). Sometimes one job 
needs to deal with more than one metastores. That requires adding a prefix to 
the namespace. So to locate a column, we need something like 
metastore.db.table.column_outlayerscolumn_innerlayers.crypto_key. This is 
not user friendly. Again, other schemas like Avro, Spark already have that, I 
think it would be better to alight with other schemas. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-11 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-672093542


   @ggershinsky, from what we have discussed above I think now the different 
opinions narrow to 'how to transport the settings from the extended 
ParquetWriteSupport to CryptoPropertiesFactory implementation'. Either we can 
let it through 'Configuration' or schema. I don't see the difference between 
these two approaches regarding the stated goal. 
   
   Regarding which way is better, I think it depends. To some extend, adding 
the encryption properties to the schema is easier and less error-prone because 
it is just next to the schema elements. It seems Gabor got the point.  We 
should let the user choose the one that fits their use case better. Some users 
can even choose RPC calls instead of the two we talked about. This is already 
3rd. There could be more, for example, they can choose to load from 
configuration files etc. Again, it should be the user's choice. The column 
level properties being part of the column metadata should not be a problem 
because Avro, Spark already have that. If the change to add the metadata field 
is risky in terms of breaking existing code, that would be a concern. But it 
seems not the case.  So I don't quite get the point why we don't want to do 
that. 
   
   We have had several rounds of discussion here. If you still have questions 
or concerns, I would like to set up a meeting to go through that. Please let me 
know. 
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-10 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-671465375


   
   > But you have other options in the same WriteSupport init function. One of 
them is to use the Hadoop configuration. It is not global anymore, but rather a 
local copy, an object created for the given worker. It still carries the global 
properties, and you can add your crypto properties here; they will be carried 
to the crypto factory.
   
   @ggershinsky, thanks for understanding! As discussed in the 3rd paragraph in 
my last reply, we do have options to use 'Configuration' but we prefer to let 
the column level settings within the schema itself. The reasons are explained 
in the 1st paragraph. Although it is a local copy, the problems I mentioned are 
still there. I tried that solution earlier before coming to this current 
solution.  Using 'extraMetadata' field is slightly better but still using 
schema is better. 
   
   Since we already defined the interface EncryptionPropertiesFactory, I think 
we can let users decide how do they construct it and which channel they get the 
needed information because in reality there could be many other use cases that 
we don't know at this point. They should choose the one that best fits their 
use cases. I don't think we should/can limit to use just one. As you mentioned, 
they can use 'Configuration' and 'extraMetadata' today that are already two. 
People can also use RPC calls to get it sometime etc. 
   
   Having a metadata field at column level in the schema, in my opinion, should 
be supported in Parquet as I mentioned multiple times earlier that many other 
system's schemas like Avro, Spark already have. If we have some column-wise 
configurations, that would be a good place to set it. I think Gabor already has 
a PR for column-wise configuration. 
   
   I think we are clear on the use cases, needs, change themselves. Can you 
share your concern about adding the metadata field? I feel it should be a safe 
change(the earlier comments with Gabor discussed that). Please let me know what 
do you think.  
   

   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-07 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-670547099


   > in the meantime, could you send an example of such extended 
ParquetWriteSupport?
   > If CryptoHoodieAvroWriteSupport is the example, could you fix the link 
above, I get a 404 error.
   
   Can you check the link again?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-06 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-670162874


   Thanks Gidon! No problem. We have deployed completely, so no hurry for
   reviewing. As long as it is merged to upstream, we have a path to go in the
   future.
   
   For the Configuration setting approach, that was our initial thinking but
   soonly we found out it requires Parquet applications(We have all 3 query
   engines: Spark/Hive/Presto and in-house Parquet tools) to be changed to set
   the Configuration. So we change to use extended ParquetWriteSupport as
   plugin mode. That is how we get here today.
   
   On Thu, Aug 6, 2020 at 12:48 PM ggershinsky 
   wrote:
   
   > Thanks @shangxinli  for the detailed
   > response, much appreciated! I want to allocate ample time to get to the
   > bottom of this subject, and with the weekend starting here, it might be a
   > problem in the next few days; but I'll get back on this early next week.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > ,
   > or unsubscribe
   > 

   > .
   >
   
   
   -- 
   Xinli Shang
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-06 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-670094258


   Hi @ggershinsky, Thanks for your comments! 
   
   I think the current difference is how to transport the crypto settings from 
source to destination(EncryptionPropertiesFactory implementation). What you 
suggest is via 'Configuration' while our approach is hybrid('Configuration' for 
global settings 'Schema' for db/table/column settings). The concern for using 
'Configuration' for all the settings is that it is global and we need a 
well-designed namespace and protocols(string format) to locate 
db/table/column(nested), along with other table wide settings as the map key 
which will need a lot of string operations and is problematic. The hybrid 
solution is more compact as the metadata is attached to the table/column as a 
structure field. Think about if I want to encrypt footer for table1 but not for 
table2, and AES-GCM for table3 but AES-CTR for table4. If we use 
'Configuration', it is doable but see how complex to define the string format 
as the protocol to convey this setting. And sometimes, a job needs to deal with 
different meta sto
 res, which would require a prefix before 'db'. 
   
   Regarding changes to existing code, I understand if we set Configuration, 
then no need to change the code after that. But somewhere need to put the 
settings into Configuration and it will be code change in some cases. Think 
about somebody(might not be engineers) sets a table/column to be encrypted in 
HMS or other centralized metastore. Then that setting needs to converted into 
settings in Configuration, then code change is unavoidable, correct? We prefer 
the changes happening in a plugin owned by a dedicated team instead of in the 
feature team's codebase that are so many. That is why we extend 
ParquetWriteSupport and build it as a plugin. 
   
   There is an option that in the extended ParquetWriteSupport, we can put it 
to Configuration instead of Parquet schema. As mentioned above, we think Schema 
settings in this case would be more compact and less error-prone than in 
Configuration. 
   
   In my opinion, your solution is better to use at the time when users create 
a table and want to set some columns to be encrypted. They only need to set it 
in Configuration. But it is difficult to use if you already have a lot of 
tables that are not encrypted today. Due to some initiatives, they need to be 
encrypted. Somebody(might not be the table owner, it could be security team, or 
legal team etc) change the settings in the megastore for a table/column, the 
framework would just start the encryption for that table/column. That is why we 
called it schema controlled because when the schema defined in metastore is 
changed,  it controlled the encryption of the downstream.  
 
   I slightly disagree with your statement that it is not Schema activated. I 
think it depends on the definition of Schema. In our definition, the settings 
for a table or a column including name, type, nullable, crypto settings are all 
considered as part of the schema.  Of course, you can define it another way but 
I think the name or definition is not important. We should focus on whether or 
not it makes sense for users to use.   
   
   Essentially this change is mainly introducing a map field for Parquet 
schema, while all other changes are in tests(or plugin). As mentioned earlier, 
if we look at the schema definition in other systems like Avro, Spark, they all 
have associated metadata. I think it is the right thing to do to add that 
metadata field in Parquet schema to align with other system's schema.  
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-05 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-669333155


   The way it works is SchemaCryptoPropertiesFactory should have RPC call with 
KMS to get the column key for keyId/metadata defined in metastore, and build up 
encryption properties with the column key, keyId/metadata etc. We don't want to 
release helper function like setKey() etc because that will require code 
changes in the existing pipelines.  Instead, we extend ParquetWriteSupport 
which converts crypto setting in schema(it could be Avro schema or other 
schemas) to Parquet schema's metadata. Then the class 
SchemaCryptoPropertiesFactory just consumes that metadata and builds the 
encryption properties that are required by Parquet-1178. By doing this, 
SchemaCryptoPropertiesFactory and extended ParquetWriteSupport are released as 
a library plugin and can be enabled at the cluster level. So the pipelines just 
need to change the setting for enabling it and set the classpath for it. 
   
   In the test file SchemaCryptoPropertiesFactory(), I hardcoded the 
key/keymetadata because we don't have real KMS to use.  CryptoGroupWriteSupport 
in the test is an example to extend WriteSupport to set metadata. In real 
usage, the metadata can be gotten from the schema, for example, avro schema for 
Hudi, or table proerpties in HMS etc. So it depends on what Parquet application 
it is.  Here in the test I just hardcoded for explaining purpose only. As I 
mentioned earlier, I have the CryptoHoodieAvroWriteSupport her 
https://github.com/shangxinli/parquet-write-supports/blob/master/src/main/java/com/uber/hoodie/avro/CryptoHoodieAvroWriteSupport.java,
 but here I just used CryptoGroupWriteSupport as a demo. 
   
   I know it is a little confusing because in Parquet code we don't' really let 
user set key or key metadata. Actually what we need from Parquet is just an 
extra field to transport the crypto settings without knowing what 
format/content of the settings is. Then everything else is implemented in the 
plugin(showing in test). 
   
   I didn't address your other comments yet. Let me know if you are OK with it 
then I can start looking at those other comments. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-04 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-668832445


   I am fine with adding metadata to 'Type' directly. Hopefully, there are no 
tools revoking it using reflection.  Since this will be in a major release, it 
should be fine. Just updated it. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-07-31 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-667344642


   > I think, directly using `ExtType` is not really user friendly. As the 
"metadata" for a type is currently used only for encryption I would not expose 
it directly to the user. There are a couple of ways for creating a parquet 
schema (e.g. converting from an avro schema, using the type builders, parsing 
it from a string etc.). So the user needs an easy way to mark the columns to be 
encrypted and set the related encryption key.
   > I would suggest having a utility method that allows to set only the 
required data for an addressed field in a schema. For example: 
`XXXUtility.encryptColumn(MessageType schema, ColumnPath column, String 
encKey)`. The logic behind this utility can be hidden from the user.
   > 
   > I don't really like the decorator pattern. First, it requires a lot of 
boilerplate code to have all the methods being delegated. Second, if the super 
class is modified by having a new method it will not generate any 
errors/warnings if the decorator would not be updated but may cause serious 
issues at runtime.
   > Why do you not want to extend `Type` (or more `PrimitiveType`) directly? 
If you not want to bother the users with this metadata that is not tightly 
connected to the schema you might define it _package private_ and let only the 
mentioned utility reach it.
   > BTW, do we need this _metadata_ for the primitive types or it is enough to 
have a specific object that contains only the required values for encryption?
   
   Thanks for the comments! The idea is to piggyback the existing 
WriteSupport's schema conversion from other type’s schema e,g avro to Parquet 
schema to pass down the crypto settings. By doing this way, when we build  
FileEncryptionProperties object, we don't need RPC calls anymore to get it 
somewhere.  Here is an example of Apache Hudi's WriteSupport example 
https://github.com/shangxinli/parquet-write-supports/blob/master/src/main/java/com/uber/hoodie/avro/CryptoHoodieAvroWriteSupport.java.
  It extends HoodieAvroWriteSupport to translate the crypto settings from avro 
schema's metadata to Parquet schema’s metadata. The CryptoPropertiesFactory 
objects can use it like this 
https://github.com/apache/parquet-mr/pull/808/commits/9adb75a2356147a204d76c82eb39f43e9ee72b58#diff-886dda84b09a5aaf0bf7c1e718c1be02R61
   
   We have built the extended WriteSupport classes for each system like Hudi, 
Spark, Hive into a library along with the CryptoPropertiesFactory 
implementation classes into a library. That library is added at cluster-wide 
and all the jobs can have it without developers changing their code. The table 
owner only need to change schema settings in metastore like HMS or other 
centralized metastore by tagging a column is to be encrypted. For a company 
that has a lot of existing tables to be onboarded to this feature, it would be 
much easier. We also got an agreement with ORC on defining the format of the 
encryption settings https://issues.apache.org/jira/browse/HIVE-21848. 
   
   What we need from the Parquet Schema is a place to keep the metadata. A 
schema definition usually has a place to do so like Avro schema, Spark 
Schema(StructField), but I don't find Parquet schema has it. So it could be 
used for other purposes. 
   
   I agreed with you on the opinion of the decoration pattern. But I think 
adding metadata to the class Type directly is risky. I am afraid of breaking 
existing functionality as it is used so widely. Your concern of changing the 
class Type and forgetting to change ExtType class is valid, but for this 
particular case, we didn’t just add or reference existing methods without 
changing. If the methods of Type is changed, ExtType shouldn’t need to be 
changed. If there is adding or deletion of methods in Type, there will be build 
error so that people know to fix it. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org