[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables
[ https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130892#comment-14130892 ] Sergio Peña commented on HIVE-5207: --- I see this is a full encryption solution which looks prety good, but since HDFS encryption is comming soon (HDFS-6134) I'd like to make some improvements to make Hive work with it. This is meant to be compatible with HDFS encryption only. It still lacks the ability to encrypt tables on fly through Hive statements. See HIVE-8065 Support data encryption for Hive tables --- Key: HIVE-5207 URL: https://issues.apache.org/jira/browse/HIVE-5207 Project: Hive Issue Type: New Feature Affects Versions: 0.12.0 Reporter: Jerry Chen Labels: Rhino Attachments: HIVE-5207.patch, HIVE-5207.patch Original Estimate: 504h Remaining Estimate: 504h For sensitive and legally protected data such as personal information, it is a common practice that the data is stored encrypted in the file system. To enable Hive with the ability to store and query the encrypted data is very crucial for Hive data analysis in enterprise. When creating table, user can specify whether a table is an encrypted table or not by specify a property in TBLPROPERTIES. Once an encrypted table is created, query on the encrypted table is transparent as long as the corresponding key management facilities are set in the running environment of query. We can use hadoop crypto provided by HADOOP-9331 for underlying data encryption and decryption. As to key management, we would support several common key management use cases. First, the table key (data key) can be stored in the Hive metastore associated with the table in properties. The table key can be explicit specified or auto generated and will be encrypted with a master key. There are cases that the data being processed is generated by other applications, we need to support externally managed or imported table keys. Also, the data generated by Hive may be consumed by other applications in the system. We need to a tool or command for exporting the table key to a java keystore for using externally. To handle versions of Hadoop that do not have crypto support, we can avoid compilation problems by segregating crypto API usage into separate files (shims) to be included only if a flag is defined on the Ant command line (something like –Dcrypto=true). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables
[ https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794816#comment-13794816 ] Jerry Chen commented on HIVE-5207: -- {quote}This patch won't compile, because Hive has to work when used with Hadoop 1.x. The shims are used to support multiple versions of Hadoop (Hadoop 0.20, Hadoop 1.x, Hadoop 0.23, Hadoop 2.x) depending on what is install on the host system.{quote} This patch depends on the crypto feature added by HADOOP-9331 and others. For this patch to compile with the Hadoop versions that does included crypto feature, we need to add flag –Dcrypto=true to disable this feature. I do understand that this approach still doesn’t align with the target of support multiple versions for a single compile of Hive. {quote}Furthermore, this seems likes the wrong direction. What is the advantage of this rather large patch over using the cfs work? If the user defines a table in cfs all of the table's data will be encrypted.{quote} I agree that cfs work has its value on transparency on API and it is a good stuff. We are working on CFS and it is currently not available yet. And the work here is already in use by our users who are using this to protect sensitive data on their clusters while being able to transparently decrypt the data while running jobs that process this encrypted data. On the other hand, we see that compression codec is already widely used for various file formats used by Hive. The issue here may be the current approach depends on the changes to specific file formats for handling encryption key contexts. One possible direction is to make the encryption codec strictly the same as compression codec so that Hive can utilize a codec doing encryption or decryption without any changes to file formats and Hive. If we can do that, it adds the same value as compression codec. Support data encryption for Hive tables --- Key: HIVE-5207 URL: https://issues.apache.org/jira/browse/HIVE-5207 Project: Hive Issue Type: New Feature Affects Versions: 0.12.0 Reporter: Jerry Chen Labels: Rhino Attachments: HIVE-5207.patch, HIVE-5207.patch Original Estimate: 504h Remaining Estimate: 504h For sensitive and legally protected data such as personal information, it is a common practice that the data is stored encrypted in the file system. To enable Hive with the ability to store and query the encrypted data is very crucial for Hive data analysis in enterprise. When creating table, user can specify whether a table is an encrypted table or not by specify a property in TBLPROPERTIES. Once an encrypted table is created, query on the encrypted table is transparent as long as the corresponding key management facilities are set in the running environment of query. We can use hadoop crypto provided by HADOOP-9331 for underlying data encryption and decryption. As to key management, we would support several common key management use cases. First, the table key (data key) can be stored in the Hive metastore associated with the table in properties. The table key can be explicit specified or auto generated and will be encrypted with a master key. There are cases that the data being processed is generated by other applications, we need to support externally managed or imported table keys. Also, the data generated by Hive may be consumed by other applications in the system. We need to a tool or command for exporting the table key to a java keystore for using externally. To handle versions of Hadoop that do not have crypto support, we can avoid compilation problems by segregating crypto API usage into separate files (shims) to be included only if a flag is defined on the Ant command line (something like –Dcrypto=true). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables
[ https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13793897#comment-13793897 ] Jerry Chen commented on HIVE-5207: -- Hi Larry, thanks for you pointing out the docs. Yes, we will complement more javadocs and document as our next work. {quote}1. TwoTieredKey - exactly the purpose, how it's used what the tiers are, etc{quote} TwoTiredKey is used for the case that the table key is stored in the Hive metastore. The table key will be encrypted with the master key which is provided externally. In this case, user maintains and manages only the master key externally other than manages all the table keys externally. This is useful when there is no full-fledged key management system available. {quote}2. External KeyManagement integration - where and what is the expected contract for this integration{quote} To integrate with external key management system, we use the KeyProvider interface in HADOOP-9331. Implementation of KeyProvider interface for a specified key management system can be set as KeyProvider for retrieving key. {quote}3. A specific usecase description for exporting keys into an external keystore and who has the authority to initiate the export and where the password comes from{quote} Exporting of the internal keys comes with the Hive command line. As the internal table keys were encrypted with the master key, when performing the exporting, the master key must be provided in the environment which is controlled by the user. If the master key is not available, the encrypted table keys for exporting cannot be decrypted and thus cannot be exported. The KeyProvider implementation for retrieving master key can provide its own authentication and authorization for deciding whether the current user has access to a specific key. {quote}4. An explanation as to why we should ever store the key with the data which seems like a bad idea. I understand that it is encrypted with the master secret - which takes me to the next question. {quote} Exactly speaking, it is not with the data. The table key is stored in the Hive metastore. I see your points at this question. Just as mentioned, for use cases that there is no full-fledged and ready to use key management system available, it is useful. We provide several alternatives for managing keys. When creating an encrypted table, user can specify whether the key is managed externally or internally. For externally managed keys, only the key name (alias) will be stored in the Hive metastore and the key will be retrieved through KeyProvider set in the configuration. {quote}5. Where is the master secret established and stored and how is it protected{quote} Currently, we assume that the user manages the master key. For example, for simple uses cases, he can stores the master key in java KeyStore which protected by a password and stores in the folder which is read-only for specific user or groups. User can also stores the master key in other key management system as the master key is retrieved through KeyProvider. Really appreciate your time reviewing this. Thanks Support data encryption for Hive tables --- Key: HIVE-5207 URL: https://issues.apache.org/jira/browse/HIVE-5207 Project: Hive Issue Type: New Feature Affects Versions: 0.12.0 Reporter: Jerry Chen Labels: Rhino Attachments: HIVE-5207.patch, HIVE-5207.patch Original Estimate: 504h Remaining Estimate: 504h For sensitive and legally protected data such as personal information, it is a common practice that the data is stored encrypted in the file system. To enable Hive with the ability to store and query the encrypted data is very crucial for Hive data analysis in enterprise. When creating table, user can specify whether a table is an encrypted table or not by specify a property in TBLPROPERTIES. Once an encrypted table is created, query on the encrypted table is transparent as long as the corresponding key management facilities are set in the running environment of query. We can use hadoop crypto provided by HADOOP-9331 for underlying data encryption and decryption. As to key management, we would support several common key management use cases. First, the table key (data key) can be stored in the Hive metastore associated with the table in properties. The table key can be explicit specified or auto generated and will be encrypted with a master key. There are cases that the data being processed is generated by other applications, we need to support externally managed or imported table keys. Also, the data generated by Hive may be consumed by other applications in the system. We need to a tool or command for exporting the table key to a java keystore for using externally.
[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables
[ https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789356#comment-13789356 ] Owen O'Malley commented on HIVE-5207: - This patch won't compile, because Hive has to work when used with Hadoop 1.x. The shims are used to support multiple versions of Hadoop (Hadoop 0.20, Hadoop 1.x, Hadoop 0.23, Hadoop 2.x) depending on what is install on the host system. Furthermore, this seems likes the wrong direction. What is the advantage of this rather large patch over using the cfs work? If the user defines a table in cfs all of the table's data will be encrypted. Support data encryption for Hive tables --- Key: HIVE-5207 URL: https://issues.apache.org/jira/browse/HIVE-5207 Project: Hive Issue Type: New Feature Affects Versions: 0.12.0 Reporter: Jerry Chen Labels: Rhino Attachments: HIVE-5207.patch, HIVE-5207.patch Original Estimate: 504h Remaining Estimate: 504h For sensitive and legally protected data such as personal information, it is a common practice that the data is stored encrypted in the file system. To enable Hive with the ability to store and query the encrypted data is very crucial for Hive data analysis in enterprise. When creating table, user can specify whether a table is an encrypted table or not by specify a property in TBLPROPERTIES. Once an encrypted table is created, query on the encrypted table is transparent as long as the corresponding key management facilities are set in the running environment of query. We can use hadoop crypto provided by HADOOP-9331 for underlying data encryption and decryption. As to key management, we would support several common key management use cases. First, the table key (data key) can be stored in the Hive metastore associated with the table in properties. The table key can be explicit specified or auto generated and will be encrypted with a master key. There are cases that the data being processed is generated by other applications, we need to support externally managed or imported table keys. Also, the data generated by Hive may be consumed by other applications in the system. We need to a tool or command for exporting the table key to a java keystore for using externally. To handle versions of Hadoop that do not have crypto support, we can avoid compilation problems by segregating crypto API usage into separate files (shims) to be included only if a flag is defined on the Ant command line (something like –Dcrypto=true). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables
[ https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777416#comment-13777416 ] Larry McCay commented on HIVE-5207: --- Hi Jerry - I have taken a high level look through the patch. Lots of good stuff there - good work! A couple things that I would like to see more javadocs on and perhaps a document that describe the usecases: 1. TwoTieredKey - exactly the purpose, how it's used what the tiers are, etc 2. External KeyManagement integration - where and what is the expected contract for this integration 3. A specific usecase description for exporting keys into an external keystore and who has the authority to initiate the export and where the password comes from 4. An explanation as to why we should ever store the key with the data which seems like a bad idea. I understand that it is encrypted with the master secret - which takes me to the next question. :) 5. Where is the master secret established and stored and how is it protected There is a minor typo/spelling error that you probably want to fix now rather than later: +public interface HiveKeyResolver { + void init(Configuration conf) throws CryptoException; + + /** + * Resolve the key meta information of a table + * @param tableDesc The table descriptor + */ + KeyMeta resovleKey(TableDesc tableDesc); +} change resovleKey to resolveKey here and in the interface implementation and consumer of the method - I think there were 3 instances. Again, nice work here! Let's get some higher level descriptions in code javadocs and/or separate documents. Thanks! Support data encryption for Hive tables --- Key: HIVE-5207 URL: https://issues.apache.org/jira/browse/HIVE-5207 Project: Hive Issue Type: New Feature Affects Versions: 0.12.0 Reporter: Jerry Chen Labels: Rhino Attachments: HIVE-5207.patch Original Estimate: 504h Remaining Estimate: 504h For sensitive and legally protected data such as personal information, it is a common practice that the data is stored encrypted in the file system. To enable Hive with the ability to store and query the encrypted data is very crucial for Hive data analysis in enterprise. When creating table, user can specify whether a table is an encrypted table or not by specify a property in TBLPROPERTIES. Once an encrypted table is created, query on the encrypted table is transparent as long as the corresponding key management facilities are set in the running environment of query. We can use hadoop crypto provided by HADOOP-9331 for underlying data encryption and decryption. As to key management, we would support several common key management use cases. First, the table key (data key) can be stored in the Hive metastore associated with the table in properties. The table key can be explicit specified or auto generated and will be encrypted with a master key. There are cases that the data being processed is generated by other applications, we need to support externally managed or imported table keys. Also, the data generated by Hive may be consumed by other applications in the system. We need to a tool or command for exporting the table key to a java keystore for using externally. To handle versions of Hadoop that do not have crypto support, we can avoid compilation problems by segregating crypto API usage into separate files (shims) to be included only if a flag is defined on the Ant command line (something like –Dcrypto=true). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables
[ https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770520#comment-13770520 ] Jerry Chen commented on HIVE-5207: -- HIVE-4227 is specifically for adding column level encryption for ORC files. As we all know, Hive tables support various other formats such as text file, sequence file, RC file and Avro file. HIVE-5207 is targeting to address these file formats as a common problem for supporting encryption. As to key management, from the user’s perspective one rational approach is one table key per encrypted table. In this concept, it is natural to associate the key with TblProperties Support data encryption for Hive tables --- Key: HIVE-5207 URL: https://issues.apache.org/jira/browse/HIVE-5207 Project: Hive Issue Type: New Feature Affects Versions: 0.12.0 Reporter: Jerry Chen Labels: Rhino Original Estimate: 504h Remaining Estimate: 504h For sensitive and legally protected data such as personal information, it is a common practice that the data is stored encrypted in the file system. To enable Hive with the ability to store and query the encrypted data is very crucial for Hive data analysis in enterprise. When creating table, user can specify whether a table is an encrypted table or not by specify a property in TBLPROPERTIES. Once an encrypted table is created, query on the encrypted table is transparent as long as the corresponding key management facilities are set in the running environment of query. We can use hadoop crypto provided by HADOOP-9331 for underlying data encryption and decryption. As to key management, we would support several common key management use cases. First, the table key (data key) can be stored in the Hive metastore associated with the table in properties. The table key can be explicit specified or auto generated and will be encrypted with a master key. There are cases that the data being processed is generated by other applications, we need to support externally managed or imported table keys. Also, the data generated by Hive may be consumed by other applications in the system. We need to a tool or command for exporting the table key to a java keystore for using externally. To handle versions of Hadoop that do not have crypto support, we can avoid compilation problems by segregating crypto API usage into separate files (shims) to be included only if a flag is defined on the Ant command line (something like –Dcrypto=true). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables
[ https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757433#comment-13757433 ] Larry McCay commented on HIVE-5207: --- This seems to be a duplicate of HIVE-4227. I am actually in the process of working on that functionality and plan to leverage HADOOP-9331 as appropriate. We will need to rationalize these Jiras. Maybe you calling out the difference between the Jiras as the entire table being encrypted here rather than the individual columns in 4227? I think that if we need both levels of granularity that they need to be based on the same solution. The key management aspect is one that we will need to sync on. The patch in HADOOP-9534 (CMF) is being refactored in order to support our API needs for acquiring keys for Hive encryption and presumably for CryptoFS. Generally speaking, the nonce/iv, alias and version indicator will be stored within the colstore in Hive for decryption. That is the current thinking anyway. Support for multiple key revisions per alias will allow for rotation and rolling of keys within the datastores. CMF will provide pluggability for talking to key management/data protection providers: initially a JCEKS keystore and eventually a central key management/data protection service for Hadoop. The central service will also provide pluggability for integrating third party providers/solutions. TableProperties is one way to indicate the need for data protection - we are looking at others as well - but of course I am currently looking at column level indicators too. Let's figure out how to combine or consolidate these Jiras so that we can hopefully get a coherent set of patches to collaborate with in a branch. Support data encryption for Hive tables --- Key: HIVE-5207 URL: https://issues.apache.org/jira/browse/HIVE-5207 Project: Hive Issue Type: New Feature Affects Versions: 0.12.0 Reporter: Jerry Chen Labels: Rhino Original Estimate: 504h Remaining Estimate: 504h For sensitive and legally protected data such as personal information, it is a common practice that the data is stored encrypted in the file system. To enable Hive with the ability to store and query the encrypted data is very crucial for Hive data analysis in enterprise. When creating table, user can specify whether a table is an encrypted table or not by specify a property in TBLPROPERTIES. Once an encrypted table is created, query on the encrypted table is transparent as long as the corresponding key management facilities are set in the running environment of query. We can use hadoop crypto provided by HADOOP-9331 for underlying data encryption and decryption. As to key management, we would support several common key management use cases. First, the table key (data key) can be stored in the Hive metastore associated with the table in properties. The table key can be explicit specified or auto generated and will be encrypted with a master key. There are cases that the data being processed is generated by other applications, we need to support externally managed or imported table keys. Also, the data generated by Hive may be consumed by other applications in the system. We need to a tool or command for exporting the table key to a java keystore for using externally. To handle versions of Hadoop that do not have crypto support, we can avoid compilation problems by segregating crypto API usage into separate files (shims) to be included only if a flag is defined on the Ant command line (something like –Dcrypto=true). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira