[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904317#comment-15904317 ] Lefty Leverenz commented on HIVE-8065: -- The encryption branch was merged to trunk for release 1.1.0 (formerly known as 0.15). See HIVE-9264. > Support HDFS encryption functionality on Hive > - > > Key: HIVE-8065 > URL: https://issues.apache.org/jira/browse/HIVE-8065 > Project: Hive > Issue Type: Improvement >Affects Versions: 0.13.1 >Reporter: Sergio Peña >Assignee: Sergio Peña > Labels: TODOC15 > Fix For: 1.1.0 > > > The new encryption support on HDFS makes Hive incompatible and unusable when > this feature is used. > HDFS encryption is designed so that an user can configure different > encryption zones (or directories) for multi-tenant environments. An > encryption zone has an exclusive encryption key, such as AES-128 or AES-256. > Because of security compliance, the HDFS does not allow to move/rename files > between encryption zones. Renames are allowed only inside the same encryption > zone. A copy is allowed between encryption zones. > See HDFS-6134 for more details about HDFS encryption design. > Hive currently uses a scratch directory (like /tmp/$user/$random). This > scratch directory is used for the output of intermediate data (between MR > jobs) and for the final output of the hive query which is later moved to the > table directory location. > If Hive tables are in different encryption zones than the scratch directory, > then Hive won't be able to renames those files/directories, and it will make > Hive unusable. > To handle this problem, we can change the scratch directory of the > query/statement to be inside the same encryption zone of the table directory > location. This way, the renaming process will be successful. > Also, for statements that move files between encryption zones (i.e. LOAD > DATA), a copy may be executed instead of a rename. This will cause an > overhead when copying large data files, but it won't break the encryption on > Hive. > Another security thing to consider is when using joins selects. If Hive joins > different tables with different encryption key strengths, then the results of > the select might break the security compliance of the tables. Let's say two > tables with 128 bits and 256 bits encryption are joined, then the temporary > results might be stored in the 128 bits encryption zone. This will conflict > with the table encrypted with 256 bits temporary. > To fix this, Hive should be able to select the scratch directory that is more > secured/encrypted in order to save the intermediate data temporary with no > compliance issues. > For instance: > {noformat} > SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; > {noformat} > - This should use a scratch directory (or staging directory) inside the > table-aes256 table location. > {noformat} > INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; > {noformat} > - This should use a scratch directory inside the table-aes1 location. > {noformat} > FROM table-unencrypted > INSERT OVERWRITE TABLE table-aes128 SELECT id, name > INSERT OVERWRITE TABLE table-aes256 SELECT id, name > {noformat} > - This should use a scratch directory on each of the tables locations. > - The first SELECT will have its scratch directory on table-aes128 directory. > - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481156#comment-15481156 ] Lefty Leverenz commented on HIVE-8065: -- Nudge: [~spena], could you please document HDFS encryption in the wiki? Or if you don't have time, could you suggest someone else to do it? Thanks. > Support HDFS encryption functionality on Hive > - > > Key: HIVE-8065 > URL: https://issues.apache.org/jira/browse/HIVE-8065 > Project: Hive > Issue Type: Improvement >Affects Versions: 0.13.1 >Reporter: Sergio Peña >Assignee: Sergio Peña > Labels: TODOC15 > > The new encryption support on HDFS makes Hive incompatible and unusable when > this feature is used. > HDFS encryption is designed so that an user can configure different > encryption zones (or directories) for multi-tenant environments. An > encryption zone has an exclusive encryption key, such as AES-128 or AES-256. > Because of security compliance, the HDFS does not allow to move/rename files > between encryption zones. Renames are allowed only inside the same encryption > zone. A copy is allowed between encryption zones. > See HDFS-6134 for more details about HDFS encryption design. > Hive currently uses a scratch directory (like /tmp/$user/$random). This > scratch directory is used for the output of intermediate data (between MR > jobs) and for the final output of the hive query which is later moved to the > table directory location. > If Hive tables are in different encryption zones than the scratch directory, > then Hive won't be able to renames those files/directories, and it will make > Hive unusable. > To handle this problem, we can change the scratch directory of the > query/statement to be inside the same encryption zone of the table directory > location. This way, the renaming process will be successful. > Also, for statements that move files between encryption zones (i.e. LOAD > DATA), a copy may be executed instead of a rename. This will cause an > overhead when copying large data files, but it won't break the encryption on > Hive. > Another security thing to consider is when using joins selects. If Hive joins > different tables with different encryption key strengths, then the results of > the select might break the security compliance of the tables. Let's say two > tables with 128 bits and 256 bits encryption are joined, then the temporary > results might be stored in the 128 bits encryption zone. This will conflict > with the table encrypted with 256 bits temporary. > To fix this, Hive should be able to select the scratch directory that is more > secured/encrypted in order to save the intermediate data temporary with no > compliance issues. > For instance: > {noformat} > SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; > {noformat} > - This should use a scratch directory (or staging directory) inside the > table-aes256 table location. > {noformat} > INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; > {noformat} > - This should use a scratch directory inside the table-aes1 location. > {noformat} > FROM table-unencrypted > INSERT OVERWRITE TABLE table-aes128 SELECT id, name > INSERT OVERWRITE TABLE table-aes256 SELECT id, name > {noformat} > - This should use a scratch directory on each of the tables locations. > - The first SELECT will have its scratch directory on table-aes128 directory. > - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272449#comment-15272449 ] Sergio Peña commented on HIVE-8065: --- [~lushuai] Hive does not have a DDL to encrypt a table. The patch here was only to support encrypted files & directories handled by HDFS. The way you create an encrypted directory is: 1. Create an encryption key $ hadoop key create 2. Create an empty directory $ hdfs dfs -mkdir 3. Create an encryption zone (directory must exists and be empty) $ hdfs crypto -createZone -keyName -path 4. Create a table located on an encrypted zone hive> create table () location '' That's it. HDFS limitations are that you cannot rename or move files from an encryption zone to another encryption zone or unencrypted directories (just copy). Hive supports that, and creates staging directories in the same encryption zones, and makes sure to not leak sensitive information out of the encryption zone. > Support HDFS encryption functionality on Hive > - > > Key: HIVE-8065 > URL: https://issues.apache.org/jira/browse/HIVE-8065 > Project: Hive > Issue Type: Improvement >Affects Versions: 0.13.1 >Reporter: Sergio Peña >Assignee: Sergio Peña > Labels: TODOC15 > > The new encryption support on HDFS makes Hive incompatible and unusable when > this feature is used. > HDFS encryption is designed so that an user can configure different > encryption zones (or directories) for multi-tenant environments. An > encryption zone has an exclusive encryption key, such as AES-128 or AES-256. > Because of security compliance, the HDFS does not allow to move/rename files > between encryption zones. Renames are allowed only inside the same encryption > zone. A copy is allowed between encryption zones. > See HDFS-6134 for more details about HDFS encryption design. > Hive currently uses a scratch directory (like /tmp/$user/$random). This > scratch directory is used for the output of intermediate data (between MR > jobs) and for the final output of the hive query which is later moved to the > table directory location. > If Hive tables are in different encryption zones than the scratch directory, > then Hive won't be able to renames those files/directories, and it will make > Hive unusable. > To handle this problem, we can change the scratch directory of the > query/statement to be inside the same encryption zone of the table directory > location. This way, the renaming process will be successful. > Also, for statements that move files between encryption zones (i.e. LOAD > DATA), a copy may be executed instead of a rename. This will cause an > overhead when copying large data files, but it won't break the encryption on > Hive. > Another security thing to consider is when using joins selects. If Hive joins > different tables with different encryption key strengths, then the results of > the select might break the security compliance of the tables. Let's say two > tables with 128 bits and 256 bits encryption are joined, then the temporary > results might be stored in the 128 bits encryption zone. This will conflict > with the table encrypted with 256 bits temporary. > To fix this, Hive should be able to select the scratch directory that is more > secured/encrypted in order to save the intermediate data temporary with no > compliance issues. > For instance: > {noformat} > SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; > {noformat} > - This should use a scratch directory (or staging directory) inside the > table-aes256 table location. > {noformat} > INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; > {noformat} > - This should use a scratch directory inside the table-aes1 location. > {noformat} > FROM table-unencrypted > INSERT OVERWRITE TABLE table-aes128 SELECT id, name > INSERT OVERWRITE TABLE table-aes256 SELECT id, name > {noformat} > - This should use a scratch directory on each of the tables locations. > - The first SELECT will have its scratch directory on table-aes128 directory. > - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272067#comment-15272067 ] Lefty Leverenz commented on HIVE-8065: -- Doc note: Added a TODOC15 label (for release 1.1.0). See doc note on HIVE-9264. * [HIVE-9264 doc note | https://issues.apache.org/jira/browse/HIVE-9264?focusedCommentId=14283636=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14283636] > Support HDFS encryption functionality on Hive > - > > Key: HIVE-8065 > URL: https://issues.apache.org/jira/browse/HIVE-8065 > Project: Hive > Issue Type: Improvement >Affects Versions: 0.13.1 >Reporter: Sergio Peña >Assignee: Sergio Peña > Labels: TODOC15 > > The new encryption support on HDFS makes Hive incompatible and unusable when > this feature is used. > HDFS encryption is designed so that an user can configure different > encryption zones (or directories) for multi-tenant environments. An > encryption zone has an exclusive encryption key, such as AES-128 or AES-256. > Because of security compliance, the HDFS does not allow to move/rename files > between encryption zones. Renames are allowed only inside the same encryption > zone. A copy is allowed between encryption zones. > See HDFS-6134 for more details about HDFS encryption design. > Hive currently uses a scratch directory (like /tmp/$user/$random). This > scratch directory is used for the output of intermediate data (between MR > jobs) and for the final output of the hive query which is later moved to the > table directory location. > If Hive tables are in different encryption zones than the scratch directory, > then Hive won't be able to renames those files/directories, and it will make > Hive unusable. > To handle this problem, we can change the scratch directory of the > query/statement to be inside the same encryption zone of the table directory > location. This way, the renaming process will be successful. > Also, for statements that move files between encryption zones (i.e. LOAD > DATA), a copy may be executed instead of a rename. This will cause an > overhead when copying large data files, but it won't break the encryption on > Hive. > Another security thing to consider is when using joins selects. If Hive joins > different tables with different encryption key strengths, then the results of > the select might break the security compliance of the tables. Let's say two > tables with 128 bits and 256 bits encryption are joined, then the temporary > results might be stored in the 128 bits encryption zone. This will conflict > with the table encrypted with 256 bits temporary. > To fix this, Hive should be able to select the scratch directory that is more > secured/encrypted in order to save the intermediate data temporary with no > compliance issues. > For instance: > {noformat} > SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; > {noformat} > - This should use a scratch directory (or staging directory) inside the > table-aes256 table location. > {noformat} > INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; > {noformat} > - This should use a scratch directory inside the table-aes1 location. > {noformat} > FROM table-unencrypted > INSERT OVERWRITE TABLE table-aes128 SELECT id, name > INSERT OVERWRITE TABLE table-aes256 SELECT id, name > {noformat} > - This should use a scratch directory on each of the tables locations. > - The first SELECT will have its scratch directory on table-aes128 directory. > - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271998#comment-15271998 ] lushuai commented on HIVE-8065: --- How to create a table encryption table, for example, by specified DDL table encryption attributes, implementation, and secure area bound. By implementing MetaStoreEventListener in the onCreate Table, Drop Table, onAlterTable, onAlterTable etc. and in combination with transparent encryption. IS OK??? > Support HDFS encryption functionality on Hive > - > > Key: HIVE-8065 > URL: https://issues.apache.org/jira/browse/HIVE-8065 > Project: Hive > Issue Type: Improvement >Affects Versions: 0.13.1 >Reporter: Sergio Peña >Assignee: Sergio Peña > > The new encryption support on HDFS makes Hive incompatible and unusable when > this feature is used. > HDFS encryption is designed so that an user can configure different > encryption zones (or directories) for multi-tenant environments. An > encryption zone has an exclusive encryption key, such as AES-128 or AES-256. > Because of security compliance, the HDFS does not allow to move/rename files > between encryption zones. Renames are allowed only inside the same encryption > zone. A copy is allowed between encryption zones. > See HDFS-6134 for more details about HDFS encryption design. > Hive currently uses a scratch directory (like /tmp/$user/$random). This > scratch directory is used for the output of intermediate data (between MR > jobs) and for the final output of the hive query which is later moved to the > table directory location. > If Hive tables are in different encryption zones than the scratch directory, > then Hive won't be able to renames those files/directories, and it will make > Hive unusable. > To handle this problem, we can change the scratch directory of the > query/statement to be inside the same encryption zone of the table directory > location. This way, the renaming process will be successful. > Also, for statements that move files between encryption zones (i.e. LOAD > DATA), a copy may be executed instead of a rename. This will cause an > overhead when copying large data files, but it won't break the encryption on > Hive. > Another security thing to consider is when using joins selects. If Hive joins > different tables with different encryption key strengths, then the results of > the select might break the security compliance of the tables. Let's say two > tables with 128 bits and 256 bits encryption are joined, then the temporary > results might be stored in the 128 bits encryption zone. This will conflict > with the table encrypted with 256 bits temporary. > To fix this, Hive should be able to select the scratch directory that is more > secured/encrypted in order to save the intermediate data temporary with no > compliance issues. > For instance: > {noformat} > SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; > {noformat} > - This should use a scratch directory (or staging directory) inside the > table-aes256 table location. > {noformat} > INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; > {noformat} > - This should use a scratch directory inside the table-aes1 location. > {noformat} > FROM table-unencrypted > INSERT OVERWRITE TABLE table-aes128 SELECT id, name > INSERT OVERWRITE TABLE table-aes256 SELECT id, name > {noformat} > - This should use a scratch directory on each of the tables locations. > - The first SELECT will have its scratch directory on table-aes128 directory. > - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990507#comment-14990507 ] Sergio Peña commented on HIVE-8065: --- We can close this as it is already merged. We only need to move the sub-tasks that are not resolved out of this umbrella ticket. > Support HDFS encryption functionality on Hive > - > > Key: HIVE-8065 > URL: https://issues.apache.org/jira/browse/HIVE-8065 > Project: Hive > Issue Type: Improvement >Affects Versions: 0.13.1 >Reporter: Sergio Peña >Assignee: Sergio Peña > > The new encryption support on HDFS makes Hive incompatible and unusable when > this feature is used. > HDFS encryption is designed so that an user can configure different > encryption zones (or directories) for multi-tenant environments. An > encryption zone has an exclusive encryption key, such as AES-128 or AES-256. > Because of security compliance, the HDFS does not allow to move/rename files > between encryption zones. Renames are allowed only inside the same encryption > zone. A copy is allowed between encryption zones. > See HDFS-6134 for more details about HDFS encryption design. > Hive currently uses a scratch directory (like /tmp/$user/$random). This > scratch directory is used for the output of intermediate data (between MR > jobs) and for the final output of the hive query which is later moved to the > table directory location. > If Hive tables are in different encryption zones than the scratch directory, > then Hive won't be able to renames those files/directories, and it will make > Hive unusable. > To handle this problem, we can change the scratch directory of the > query/statement to be inside the same encryption zone of the table directory > location. This way, the renaming process will be successful. > Also, for statements that move files between encryption zones (i.e. LOAD > DATA), a copy may be executed instead of a rename. This will cause an > overhead when copying large data files, but it won't break the encryption on > Hive. > Another security thing to consider is when using joins selects. If Hive joins > different tables with different encryption key strengths, then the results of > the select might break the security compliance of the tables. Let's say two > tables with 128 bits and 256 bits encryption are joined, then the temporary > results might be stored in the 128 bits encryption zone. This will conflict > with the table encrypted with 256 bits temporary. > To fix this, Hive should be able to select the scratch directory that is more > secured/encrypted in order to save the intermediate data temporary with no > compliance issues. > For instance: > {noformat} > SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; > {noformat} > - This should use a scratch directory (or staging directory) inside the > table-aes256 table location. > {noformat} > INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; > {noformat} > - This should use a scratch directory inside the table-aes1 location. > {noformat} > FROM table-unencrypted > INSERT OVERWRITE TABLE table-aes128 SELECT id, name > INSERT OVERWRITE TABLE table-aes256 SELECT id, name > {noformat} > - This should use a scratch directory on each of the tables locations. > - The first SELECT will have its scratch directory on table-aes128 directory. > - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989592#comment-14989592 ] Lars Francke commented on HIVE-8065: What's the status of this? Think we can mark it as resolved? The encryption-branch has been merged for version 1.1.0 and I think it's confusing that this is still listed as unresolved. > Support HDFS encryption functionality on Hive > - > > Key: HIVE-8065 > URL: https://issues.apache.org/jira/browse/HIVE-8065 > Project: Hive > Issue Type: Improvement >Affects Versions: 0.13.1 >Reporter: Sergio Peña >Assignee: Sergio Peña > > The new encryption support on HDFS makes Hive incompatible and unusable when > this feature is used. > HDFS encryption is designed so that an user can configure different > encryption zones (or directories) for multi-tenant environments. An > encryption zone has an exclusive encryption key, such as AES-128 or AES-256. > Because of security compliance, the HDFS does not allow to move/rename files > between encryption zones. Renames are allowed only inside the same encryption > zone. A copy is allowed between encryption zones. > See HDFS-6134 for more details about HDFS encryption design. > Hive currently uses a scratch directory (like /tmp/$user/$random). This > scratch directory is used for the output of intermediate data (between MR > jobs) and for the final output of the hive query which is later moved to the > table directory location. > If Hive tables are in different encryption zones than the scratch directory, > then Hive won't be able to renames those files/directories, and it will make > Hive unusable. > To handle this problem, we can change the scratch directory of the > query/statement to be inside the same encryption zone of the table directory > location. This way, the renaming process will be successful. > Also, for statements that move files between encryption zones (i.e. LOAD > DATA), a copy may be executed instead of a rename. This will cause an > overhead when copying large data files, but it won't break the encryption on > Hive. > Another security thing to consider is when using joins selects. If Hive joins > different tables with different encryption key strengths, then the results of > the select might break the security compliance of the tables. Let's say two > tables with 128 bits and 256 bits encryption are joined, then the temporary > results might be stored in the 128 bits encryption zone. This will conflict > with the table encrypted with 256 bits temporary. > To fix this, Hive should be able to select the scratch directory that is more > secured/encrypted in order to save the intermediate data temporary with no > compliance issues. > For instance: > {noformat} > SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; > {noformat} > - This should use a scratch directory (or staging directory) inside the > table-aes256 table location. > {noformat} > INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; > {noformat} > - This should use a scratch directory inside the table-aes1 location. > {noformat} > FROM table-unencrypted > INSERT OVERWRITE TABLE table-aes128 SELECT id, name > INSERT OVERWRITE TABLE table-aes256 SELECT id, name > {noformat} > - This should use a scratch directory on each of the tables locations. > - The first SELECT will have its scratch directory on table-aes128 directory. > - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543952#comment-14543952 ] Sergio Peña commented on HIVE-8065: --- Hi [~mmokhtar] I did not test encryption on an HA environment. This error might be a configuration issue, or an issue in this code in {{Hadoop23Shims.java}}: {code} public boolean isPathEncrypted(Path path) throws IOException { Path fullPath; if (path.isAbsolute()) { fullPath = path; } else { fullPath = path.getFileSystem(conf).makeQualified(path); } return (hdfsAdmin.getEncryptionZoneForPath(fullPath) != null); } {code} Maybe, {{makeQualified}} is returning the full path using {{namenode:8020}}, and this is causing the error on {{getEncryptionZoneForPath}}. Can you edit your configuration file to remove the port number, and see if that works? Related with your note. The staging directory is not checked. This is the directory that is created and returned by the {{SemanticAnalyzer.getStagingDirectoryPathname}} method when a table location is found in an encryption zone. If not, then it just returns a normal scratch directory. Is there a place where you think the staging directory is checked for encryption? Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues. For instance: {noformat} SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; {noformat} - This should use a scratch directory (or staging directory) inside the table-aes256 table location. {noformat} INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; {noformat} - This should use a scratch directory inside the table-aes1 location. {noformat} FROM table-unencrypted INSERT OVERWRITE TABLE table-aes128 SELECT id, name INSERT OVERWRITE TABLE table-aes256 SELECT id, name {noformat} - This should use a scratch directory on each of the tables locations. - The first SELECT will have its scratch directory on table-aes128 directory. - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544633#comment-14544633 ] Mostafa Mokhtar commented on HIVE-8065: --- Hi [~spena] These are configs I have, which look correct to me. {code} property namefs.defaultFS/name valuehdfs://namenode:8020/value /property {code} and {code} property namedfs.namenode.rpc-address.namenode.nn1/name valuenode1:8020/value /property property namedfs.namenode.rpc-address.namenode.nn2/name valuenode2:8020/value /property {code} Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues. For instance: {noformat} SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; {noformat} - This should use a scratch directory (or staging directory) inside the table-aes256 table location. {noformat} INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; {noformat} - This should use a scratch directory inside the table-aes1 location. {noformat} FROM table-unencrypted INSERT OVERWRITE TABLE table-aes128 SELECT id, name INSERT OVERWRITE TABLE table-aes256 SELECT id, name {noformat} - This should use a scratch directory on each of the tables locations. - The first SELECT will have its scratch directory on table-aes128 directory. - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544906#comment-14544906 ] Brock Noland commented on HIVE-8065: I've not seen an ha configuration specify the port in the main config. I believe the testing for HA would not have had the port when in HA mode. Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues. For instance: {noformat} SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; {noformat} - This should use a scratch directory (or staging directory) inside the table-aes256 table location. {noformat} INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; {noformat} - This should use a scratch directory inside the table-aes1 location. {noformat} FROM table-unencrypted INSERT OVERWRITE TABLE table-aes128 SELECT id, name INSERT OVERWRITE TABLE table-aes256 SELECT id, name {noformat} - This should use a scratch directory on each of the tables locations. - The first SELECT will have its scratch directory on table-aes128 directory. - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538548#comment-14538548 ] Mostafa Mokhtar commented on HIVE-8065: --- [~spena] I am running on a cluster which has Namenode HA enabled and I can't run queries, wondering if Namenode HA was tested? This is the call stack I am getting {code} at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:1866) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:1943) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStagingDirectoryPathname(SemanticAnalyzer.java:1975) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1788) ... 25 more Caused by: java.lang.IllegalArgumentException: Wrong FS: hdfs://namenode:8020/apps/hive/warehouse/test_table, expected: hdfs://namenode at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193) at org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1906) at org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262) at org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1245) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:1862) ... 28 more {code} I stepped through the debugger and found that the authorities mismatch thisAuthority namenode (id=86) thatAuthority namenode:8020 (id=99) Then in FileSystem.checkPath when the authorities mismatch the code falls through and throw the exception {code} thatAuthority = uri.getAuthority(); if (thisAuthority == thatAuthority || // authorities match (thisAuthority != null thisAuthority.equalsIgnoreCase(thatAuthority))) return; } } throw new IllegalArgumentException(Wrong FS: +path+ , expected: +this.getUri()); {code} Is this an issue with HDFS encryption on Hive or you think this is a configuration issue? On a related note I don't think Hive should be checking if the staging directory is encrypted if none of the Hive managed tables are encrypted. Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña Labels: Hive-Scrum The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues.
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529998#comment-14529998 ] Brock Noland commented on HIVE-8065: In that case the results of the query are staged in ez1. Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña Labels: Hive-Scrum The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues. For instance: {noformat} SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; {noformat} - This should use a scratch directory (or staging directory) inside the table-aes256 table location. {noformat} INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; {noformat} - This should use a scratch directory inside the table-aes1 location. {noformat} FROM table-unencrypted INSERT OVERWRITE TABLE table-aes128 SELECT id, name INSERT OVERWRITE TABLE table-aes256 SELECT id, name {noformat} - This should use a scratch directory on each of the tables locations. - The first SELECT will have its scratch directory on table-aes128 directory. - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531443#comment-14531443 ] Sergio Peña commented on HIVE-8065: --- Hey [~thejas] Here's some answers about the issues: 1. If the encrypted zone where the results will be written is read-only, then Hive will try to use the directory set by {{hive.exec.scratchdir}} only if the scratch directory is encrypted as well (see HIVE-8945). This might create a performance issue if the encrypted scratch directory is in a different encryption zone. The user may change that directory to a writable directory inside the same encryption zone to make the move faster. This might be a little tedious for users, but it is the only way to protect their data. 2. This is a little tricky. Currently, hive selects the encryption zone that has the most strength cipher (aes128 vs aes256), and uses that location to store all final and intermediate results. This avoids writing intermediate data (aes256 to aes128), and then writing back the final result to aes256. Here we have another performance issue where final result files would be copied (and not renamed) to the destination table as encryption zones might be different. We did not do any work to deny access to stored results in another encryption zone. The solution only avoids that encrypted data touches non-encrypted zones, or weaker encrypted zones. Maybe other solutions, like Sentry, may work on this access control. But without an access control mechanism, this issue exists on the scratch directory, doesn't it? Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña Labels: Hive-Scrum The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues. For instance: {noformat} SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; {noformat} - This should use a scratch directory (or staging directory) inside the table-aes256 table location. {noformat} INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; {noformat} - This should use a scratch directory inside the table-aes1 location. {noformat} FROM table-unencrypted INSERT OVERWRITE TABLE table-aes128 SELECT id, name INSERT OVERWRITE TABLE table-aes256 SELECT id, name {noformat} - This should use a scratch directory on each of the tables locations. - The first SELECT will have its scratch directory on table-aes128 directory. - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531401#comment-14531401 ] Brock Noland commented on HIVE-8065: bq. Write permissions are now required to read from these tables. Sergio can comment how read-only tables are handled. We did think of this case. bq. Sensitive data from one zone will be stored in another. Note that file permissions are still enforced and zones are not meant to be an access control mechanism. For example, a user with appropriate permissions could copy data from one ez to another ez1. Nothing in this change, changes that fact. Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña Labels: Hive-Scrum The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues. For instance: {noformat} SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; {noformat} - This should use a scratch directory (or staging directory) inside the table-aes256 table location. {noformat} INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; {noformat} - This should use a scratch directory inside the table-aes1 location. {noformat} FROM table-unencrypted INSERT OVERWRITE TABLE table-aes128 SELECT id, name INSERT OVERWRITE TABLE table-aes256 SELECT id, name {noformat} - This should use a scratch directory on each of the tables locations. - The first SELECT will have its scratch directory on table-aes128 directory. - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531038#comment-14531038 ] Thejas M Nair commented on HIVE-8065: - Yes, there seems to be a performance vs data protection tradeoff here. Lets say, the EZ1 is supposed to have only aggregated information and users expect people given access to EZ1 to have only access to aggregated information. But the temporary data (pre aggregation) can have more sensitive information from input EZ2 and EZ3, that EZ1 users can potentially access. Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña Labels: Hive-Scrum The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues. For instance: {noformat} SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; {noformat} - This should use a scratch directory (or staging directory) inside the table-aes256 table location. {noformat} INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; {noformat} - This should use a scratch directory inside the table-aes1 location. {noformat} FROM table-unencrypted INSERT OVERWRITE TABLE table-aes128 SELECT id, name INSERT OVERWRITE TABLE table-aes256 SELECT id, name {noformat} - This should use a scratch directory on each of the tables locations. - The first SELECT will have its scratch directory on table-aes128 directory. - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531055#comment-14531055 ] Thejas M Nair commented on HIVE-8065: - Thinking more about above insert case, the performance tradeoff is not necessary. The files written to hdfs before move can still be written to EZ1 without any reduction in data protection, as it would contain data matching the final results. It is the intermediate data before that that can contain sensitive data (in case of MR mode). In case of select * from tableEZ2 inner join tableEZ3 , my understanding is it uses one of EZ2 or EZ3 for scratch dir. This creates two issues- # Write permissions are now required to read from these tables. # Sensitive data from one zone will be stored in another. Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña Labels: Hive-Scrum The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues. For instance: {noformat} SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; {noformat} - This should use a scratch directory (or staging directory) inside the table-aes256 table location. {noformat} INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; {noformat} - This should use a scratch directory inside the table-aes1 location. {noformat} FROM table-unencrypted INSERT OVERWRITE TABLE table-aes128 SELECT id, name INSERT OVERWRITE TABLE table-aes256 SELECT id, name {noformat} - This should use a scratch directory on each of the tables locations. - The first SELECT will have its scratch directory on table-aes128 directory. - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529892#comment-14529892 ] Eugene Koifman commented on HIVE-8065: -- How come the move restriction is not an issue for something like Insert Overwrite tableEZ1 select * from tableEZ2 inner join tableEZ3? Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña Labels: Hive-Scrum The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues. For instance: {noformat} SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; {noformat} - This should use a scratch directory (or staging directory) inside the table-aes256 table location. {noformat} INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; {noformat} - This should use a scratch directory inside the table-aes1 location. {noformat} FROM table-unencrypted INSERT OVERWRITE TABLE table-aes128 SELECT id, name INSERT OVERWRITE TABLE table-aes256 SELECT id, name {noformat} - This should use a scratch directory on each of the tables locations. - The first SELECT will have its scratch directory on table-aes128 directory. - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529461#comment-14529461 ] Brock Noland commented on HIVE-8065: bq. have you considered creating a single encrypted staging dir for all queries to use instead of creating new ones under the table namespace? (this could be owned by Hive and encrypted with Hive's key). If so, why did you choose the current design? This approach does not work since you cannot move files across encryption zones. Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña Labels: Hive-Scrum The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues. For instance: {noformat} SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; {noformat} - This should use a scratch directory (or staging directory) inside the table-aes256 table location. {noformat} INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; {noformat} - This should use a scratch directory inside the table-aes1 location. {noformat} FROM table-unencrypted INSERT OVERWRITE TABLE table-aes128 SELECT id, name INSERT OVERWRITE TABLE table-aes256 SELECT id, name {noformat} - This should use a scratch directory on each of the tables locations. - The first SELECT will have its scratch directory on table-aes128 directory. - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive
[ https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529378#comment-14529378 ] Eugene Koifman commented on HIVE-8065: -- [~spena], when implementing this, have you considered creating a single encrypted staging dir for all queries to use instead of creating new ones under the table namespace? (this could be owned by Hive and encrypted with Hive's key). If so, why did you choose the current design? Some possible issues with current design: Requires write permission on the table dir delete-on-exit (on stagingdir) is not completely reliable as far as I know. This may leave files around in a query like SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; when staging dir is created under table-aes256, someone how has a key for this EZ may read data (in theory at least) that came from table-aes128 even if they don't have a key for EZ which contains table-aes128. thanks Support HDFS encryption functionality on Hive - Key: HIVE-8065 URL: https://issues.apache.org/jira/browse/HIVE-8065 Project: Hive Issue Type: Improvement Affects Versions: 0.13.1 Reporter: Sergio Peña Assignee: Sergio Peña Labels: Hive-Scrum The new encryption support on HDFS makes Hive incompatible and unusable when this feature is used. HDFS encryption is designed so that an user can configure different encryption zones (or directories) for multi-tenant environments. An encryption zone has an exclusive encryption key, such as AES-128 or AES-256. Because of security compliance, the HDFS does not allow to move/rename files between encryption zones. Renames are allowed only inside the same encryption zone. A copy is allowed between encryption zones. See HDFS-6134 for more details about HDFS encryption design. Hive currently uses a scratch directory (like /tmp/$user/$random). This scratch directory is used for the output of intermediate data (between MR jobs) and for the final output of the hive query which is later moved to the table directory location. If Hive tables are in different encryption zones than the scratch directory, then Hive won't be able to renames those files/directories, and it will make Hive unusable. To handle this problem, we can change the scratch directory of the query/statement to be inside the same encryption zone of the table directory location. This way, the renaming process will be successful. Also, for statements that move files between encryption zones (i.e. LOAD DATA), a copy may be executed instead of a rename. This will cause an overhead when copying large data files, but it won't break the encryption on Hive. Another security thing to consider is when using joins selects. If Hive joins different tables with different encryption key strengths, then the results of the select might break the security compliance of the tables. Let's say two tables with 128 bits and 256 bits encryption are joined, then the temporary results might be stored in the 128 bits encryption zone. This will conflict with the table encrypted with 256 bits temporary. To fix this, Hive should be able to select the scratch directory that is more secured/encrypted in order to save the intermediate data temporary with no compliance issues. For instance: {noformat} SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id; {noformat} - This should use a scratch directory (or staging directory) inside the table-aes256 table location. {noformat} INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1; {noformat} - This should use a scratch directory inside the table-aes1 location. {noformat} FROM table-unencrypted INSERT OVERWRITE TABLE table-aes128 SELECT id, name INSERT OVERWRITE TABLE table-aes256 SELECT id, name {noformat} - This should use a scratch directory on each of the tables locations. - The first SELECT will have its scratch directory on table-aes128 directory. - The second SELECT will have its scratch directory on table-aes256 directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)