[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2017-03-09 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904317#comment-15904317
 ] 

Lefty Leverenz commented on HIVE-8065:
--

The encryption branch was merged to trunk for release 1.1.0 (formerly known as 
0.15).  See HIVE-9264.

> Support HDFS encryption functionality on Hive
> -
>
> Key: HIVE-8065
> URL: https://issues.apache.org/jira/browse/HIVE-8065
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.13.1
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>  Labels: TODOC15
> Fix For: 1.1.0
>
>
> The new encryption support on HDFS makes Hive incompatible and unusable when 
> this feature is used.
> HDFS encryption is designed so that an user can configure different 
> encryption zones (or directories) for multi-tenant environments. An 
> encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
> Because of security compliance, the HDFS does not allow to move/rename files 
> between encryption zones. Renames are allowed only inside the same encryption 
> zone. A copy is allowed between encryption zones.
> See HDFS-6134 for more details about HDFS encryption design.
> Hive currently uses a scratch directory (like /tmp/$user/$random). This 
> scratch directory is used for the output of intermediate data (between MR 
> jobs) and for the final output of the hive query which is later moved to the 
> table directory location.
> If Hive tables are in different encryption zones than the scratch directory, 
> then Hive won't be able to renames those files/directories, and it will make 
> Hive unusable.
> To handle this problem, we can change the scratch directory of the 
> query/statement to be inside the same encryption zone of the table directory 
> location. This way, the renaming process will be successful. 
> Also, for statements that move files between encryption zones (i.e. LOAD 
> DATA), a copy may be executed instead of a rename. This will cause an 
> overhead when copying large data files, but it won't break the encryption on 
> Hive.
> Another security thing to consider is when using joins selects. If Hive joins 
> different tables with different encryption key strengths, then the results of 
> the select might break the security compliance of the tables. Let's say two 
> tables with 128 bits and 256 bits encryption are joined, then the temporary 
> results might be stored in the 128 bits encryption zone. This will conflict 
> with the table encrypted with 256 bits temporary.
> To fix this, Hive should be able to select the scratch directory that is more 
> secured/encrypted in order to save the intermediate data temporary with no 
> compliance issues.
> For instance:
> {noformat}
> SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
> {noformat}
> - This should use a scratch directory (or staging directory) inside the 
> table-aes256 table location.
> {noformat}
> INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
> {noformat}
> - This should use a scratch directory inside the table-aes1 location.
> {noformat}
> FROM table-unencrypted
> INSERT OVERWRITE TABLE table-aes128 SELECT id, name
> INSERT OVERWRITE TABLE table-aes256 SELECT id, name
> {noformat}
> - This should use a scratch directory on each of the tables locations.
> - The first SELECT will have its scratch directory on table-aes128 directory.
> - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2016-09-11 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481156#comment-15481156
 ] 

Lefty Leverenz commented on HIVE-8065:
--

Nudge:  [~spena], could you please document HDFS encryption in the wiki?  Or if 
you don't have time, could you suggest someone else to do it?  Thanks.

> Support HDFS encryption functionality on Hive
> -
>
> Key: HIVE-8065
> URL: https://issues.apache.org/jira/browse/HIVE-8065
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.13.1
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>  Labels: TODOC15
>
> The new encryption support on HDFS makes Hive incompatible and unusable when 
> this feature is used.
> HDFS encryption is designed so that an user can configure different 
> encryption zones (or directories) for multi-tenant environments. An 
> encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
> Because of security compliance, the HDFS does not allow to move/rename files 
> between encryption zones. Renames are allowed only inside the same encryption 
> zone. A copy is allowed between encryption zones.
> See HDFS-6134 for more details about HDFS encryption design.
> Hive currently uses a scratch directory (like /tmp/$user/$random). This 
> scratch directory is used for the output of intermediate data (between MR 
> jobs) and for the final output of the hive query which is later moved to the 
> table directory location.
> If Hive tables are in different encryption zones than the scratch directory, 
> then Hive won't be able to renames those files/directories, and it will make 
> Hive unusable.
> To handle this problem, we can change the scratch directory of the 
> query/statement to be inside the same encryption zone of the table directory 
> location. This way, the renaming process will be successful. 
> Also, for statements that move files between encryption zones (i.e. LOAD 
> DATA), a copy may be executed instead of a rename. This will cause an 
> overhead when copying large data files, but it won't break the encryption on 
> Hive.
> Another security thing to consider is when using joins selects. If Hive joins 
> different tables with different encryption key strengths, then the results of 
> the select might break the security compliance of the tables. Let's say two 
> tables with 128 bits and 256 bits encryption are joined, then the temporary 
> results might be stored in the 128 bits encryption zone. This will conflict 
> with the table encrypted with 256 bits temporary.
> To fix this, Hive should be able to select the scratch directory that is more 
> secured/encrypted in order to save the intermediate data temporary with no 
> compliance issues.
> For instance:
> {noformat}
> SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
> {noformat}
> - This should use a scratch directory (or staging directory) inside the 
> table-aes256 table location.
> {noformat}
> INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
> {noformat}
> - This should use a scratch directory inside the table-aes1 location.
> {noformat}
> FROM table-unencrypted
> INSERT OVERWRITE TABLE table-aes128 SELECT id, name
> INSERT OVERWRITE TABLE table-aes256 SELECT id, name
> {noformat}
> - This should use a scratch directory on each of the tables locations.
> - The first SELECT will have its scratch directory on table-aes128 directory.
> - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2016-05-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272449#comment-15272449
 ] 

Sergio Peña commented on HIVE-8065:
---

[~lushuai] Hive does not have a DDL to encrypt a table. The patch here was only 
to support encrypted files & directories handled by HDFS.

The way you create an encrypted directory is:

1. Create an encryption key
$ hadoop key create 

2. Create an empty directory
$ hdfs dfs -mkdir 

3. Create an encryption zone (directory must exists and be empty)
$ hdfs crypto -createZone -keyName  -path 

4. Create a table located on an encrypted zone
hive> create table  () location ''

That's it. HDFS limitations are that you cannot rename or move files from an 
encryption zone to another encryption zone or unencrypted directories (just 
copy). Hive supports that, and creates staging directories in the same 
encryption zones, and makes sure to not leak sensitive information out of the 
encryption zone.

> Support HDFS encryption functionality on Hive
> -
>
> Key: HIVE-8065
> URL: https://issues.apache.org/jira/browse/HIVE-8065
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.13.1
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>  Labels: TODOC15
>
> The new encryption support on HDFS makes Hive incompatible and unusable when 
> this feature is used.
> HDFS encryption is designed so that an user can configure different 
> encryption zones (or directories) for multi-tenant environments. An 
> encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
> Because of security compliance, the HDFS does not allow to move/rename files 
> between encryption zones. Renames are allowed only inside the same encryption 
> zone. A copy is allowed between encryption zones.
> See HDFS-6134 for more details about HDFS encryption design.
> Hive currently uses a scratch directory (like /tmp/$user/$random). This 
> scratch directory is used for the output of intermediate data (between MR 
> jobs) and for the final output of the hive query which is later moved to the 
> table directory location.
> If Hive tables are in different encryption zones than the scratch directory, 
> then Hive won't be able to renames those files/directories, and it will make 
> Hive unusable.
> To handle this problem, we can change the scratch directory of the 
> query/statement to be inside the same encryption zone of the table directory 
> location. This way, the renaming process will be successful. 
> Also, for statements that move files between encryption zones (i.e. LOAD 
> DATA), a copy may be executed instead of a rename. This will cause an 
> overhead when copying large data files, but it won't break the encryption on 
> Hive.
> Another security thing to consider is when using joins selects. If Hive joins 
> different tables with different encryption key strengths, then the results of 
> the select might break the security compliance of the tables. Let's say two 
> tables with 128 bits and 256 bits encryption are joined, then the temporary 
> results might be stored in the 128 bits encryption zone. This will conflict 
> with the table encrypted with 256 bits temporary.
> To fix this, Hive should be able to select the scratch directory that is more 
> secured/encrypted in order to save the intermediate data temporary with no 
> compliance issues.
> For instance:
> {noformat}
> SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
> {noformat}
> - This should use a scratch directory (or staging directory) inside the 
> table-aes256 table location.
> {noformat}
> INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
> {noformat}
> - This should use a scratch directory inside the table-aes1 location.
> {noformat}
> FROM table-unencrypted
> INSERT OVERWRITE TABLE table-aes128 SELECT id, name
> INSERT OVERWRITE TABLE table-aes256 SELECT id, name
> {noformat}
> - This should use a scratch directory on each of the tables locations.
> - The first SELECT will have its scratch directory on table-aes128 directory.
> - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2016-05-05 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272067#comment-15272067
 ] 

Lefty Leverenz commented on HIVE-8065:
--

Doc note:  Added a TODOC15 label (for release 1.1.0).  See doc note on 
HIVE-9264.

* [HIVE-9264 doc note | 
https://issues.apache.org/jira/browse/HIVE-9264?focusedCommentId=14283636=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14283636]

> Support HDFS encryption functionality on Hive
> -
>
> Key: HIVE-8065
> URL: https://issues.apache.org/jira/browse/HIVE-8065
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.13.1
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>  Labels: TODOC15
>
> The new encryption support on HDFS makes Hive incompatible and unusable when 
> this feature is used.
> HDFS encryption is designed so that an user can configure different 
> encryption zones (or directories) for multi-tenant environments. An 
> encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
> Because of security compliance, the HDFS does not allow to move/rename files 
> between encryption zones. Renames are allowed only inside the same encryption 
> zone. A copy is allowed between encryption zones.
> See HDFS-6134 for more details about HDFS encryption design.
> Hive currently uses a scratch directory (like /tmp/$user/$random). This 
> scratch directory is used for the output of intermediate data (between MR 
> jobs) and for the final output of the hive query which is later moved to the 
> table directory location.
> If Hive tables are in different encryption zones than the scratch directory, 
> then Hive won't be able to renames those files/directories, and it will make 
> Hive unusable.
> To handle this problem, we can change the scratch directory of the 
> query/statement to be inside the same encryption zone of the table directory 
> location. This way, the renaming process will be successful. 
> Also, for statements that move files between encryption zones (i.e. LOAD 
> DATA), a copy may be executed instead of a rename. This will cause an 
> overhead when copying large data files, but it won't break the encryption on 
> Hive.
> Another security thing to consider is when using joins selects. If Hive joins 
> different tables with different encryption key strengths, then the results of 
> the select might break the security compliance of the tables. Let's say two 
> tables with 128 bits and 256 bits encryption are joined, then the temporary 
> results might be stored in the 128 bits encryption zone. This will conflict 
> with the table encrypted with 256 bits temporary.
> To fix this, Hive should be able to select the scratch directory that is more 
> secured/encrypted in order to save the intermediate data temporary with no 
> compliance issues.
> For instance:
> {noformat}
> SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
> {noformat}
> - This should use a scratch directory (or staging directory) inside the 
> table-aes256 table location.
> {noformat}
> INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
> {noformat}
> - This should use a scratch directory inside the table-aes1 location.
> {noformat}
> FROM table-unencrypted
> INSERT OVERWRITE TABLE table-aes128 SELECT id, name
> INSERT OVERWRITE TABLE table-aes256 SELECT id, name
> {noformat}
> - This should use a scratch directory on each of the tables locations.
> - The first SELECT will have its scratch directory on table-aes128 directory.
> - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2016-05-05 Thread lushuai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271998#comment-15271998
 ] 

lushuai commented on HIVE-8065:
---

How to create a table encryption table, for example, by specified DDL table 
encryption attributes, implementation, and secure area bound.
By implementing MetaStoreEventListener in the onCreate Table, Drop Table, 
onAlterTable, onAlterTable etc. and in combination with transparent encryption. 
IS OK???


> Support HDFS encryption functionality on Hive
> -
>
> Key: HIVE-8065
> URL: https://issues.apache.org/jira/browse/HIVE-8065
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.13.1
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>
> The new encryption support on HDFS makes Hive incompatible and unusable when 
> this feature is used.
> HDFS encryption is designed so that an user can configure different 
> encryption zones (or directories) for multi-tenant environments. An 
> encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
> Because of security compliance, the HDFS does not allow to move/rename files 
> between encryption zones. Renames are allowed only inside the same encryption 
> zone. A copy is allowed between encryption zones.
> See HDFS-6134 for more details about HDFS encryption design.
> Hive currently uses a scratch directory (like /tmp/$user/$random). This 
> scratch directory is used for the output of intermediate data (between MR 
> jobs) and for the final output of the hive query which is later moved to the 
> table directory location.
> If Hive tables are in different encryption zones than the scratch directory, 
> then Hive won't be able to renames those files/directories, and it will make 
> Hive unusable.
> To handle this problem, we can change the scratch directory of the 
> query/statement to be inside the same encryption zone of the table directory 
> location. This way, the renaming process will be successful. 
> Also, for statements that move files between encryption zones (i.e. LOAD 
> DATA), a copy may be executed instead of a rename. This will cause an 
> overhead when copying large data files, but it won't break the encryption on 
> Hive.
> Another security thing to consider is when using joins selects. If Hive joins 
> different tables with different encryption key strengths, then the results of 
> the select might break the security compliance of the tables. Let's say two 
> tables with 128 bits and 256 bits encryption are joined, then the temporary 
> results might be stored in the 128 bits encryption zone. This will conflict 
> with the table encrypted with 256 bits temporary.
> To fix this, Hive should be able to select the scratch directory that is more 
> secured/encrypted in order to save the intermediate data temporary with no 
> compliance issues.
> For instance:
> {noformat}
> SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
> {noformat}
> - This should use a scratch directory (or staging directory) inside the 
> table-aes256 table location.
> {noformat}
> INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
> {noformat}
> - This should use a scratch directory inside the table-aes1 location.
> {noformat}
> FROM table-unencrypted
> INSERT OVERWRITE TABLE table-aes128 SELECT id, name
> INSERT OVERWRITE TABLE table-aes256 SELECT id, name
> {noformat}
> - This should use a scratch directory on each of the tables locations.
> - The first SELECT will have its scratch directory on table-aes128 directory.
> - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-11-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990507#comment-14990507
 ] 

Sergio Peña commented on HIVE-8065:
---

We can close this as it is already merged. We only need to move the sub-tasks 
that are not resolved out of this umbrella ticket.

> Support HDFS encryption functionality on Hive
> -
>
> Key: HIVE-8065
> URL: https://issues.apache.org/jira/browse/HIVE-8065
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.13.1
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>
> The new encryption support on HDFS makes Hive incompatible and unusable when 
> this feature is used.
> HDFS encryption is designed so that an user can configure different 
> encryption zones (or directories) for multi-tenant environments. An 
> encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
> Because of security compliance, the HDFS does not allow to move/rename files 
> between encryption zones. Renames are allowed only inside the same encryption 
> zone. A copy is allowed between encryption zones.
> See HDFS-6134 for more details about HDFS encryption design.
> Hive currently uses a scratch directory (like /tmp/$user/$random). This 
> scratch directory is used for the output of intermediate data (between MR 
> jobs) and for the final output of the hive query which is later moved to the 
> table directory location.
> If Hive tables are in different encryption zones than the scratch directory, 
> then Hive won't be able to renames those files/directories, and it will make 
> Hive unusable.
> To handle this problem, we can change the scratch directory of the 
> query/statement to be inside the same encryption zone of the table directory 
> location. This way, the renaming process will be successful. 
> Also, for statements that move files between encryption zones (i.e. LOAD 
> DATA), a copy may be executed instead of a rename. This will cause an 
> overhead when copying large data files, but it won't break the encryption on 
> Hive.
> Another security thing to consider is when using joins selects. If Hive joins 
> different tables with different encryption key strengths, then the results of 
> the select might break the security compliance of the tables. Let's say two 
> tables with 128 bits and 256 bits encryption are joined, then the temporary 
> results might be stored in the 128 bits encryption zone. This will conflict 
> with the table encrypted with 256 bits temporary.
> To fix this, Hive should be able to select the scratch directory that is more 
> secured/encrypted in order to save the intermediate data temporary with no 
> compliance issues.
> For instance:
> {noformat}
> SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
> {noformat}
> - This should use a scratch directory (or staging directory) inside the 
> table-aes256 table location.
> {noformat}
> INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
> {noformat}
> - This should use a scratch directory inside the table-aes1 location.
> {noformat}
> FROM table-unencrypted
> INSERT OVERWRITE TABLE table-aes128 SELECT id, name
> INSERT OVERWRITE TABLE table-aes256 SELECT id, name
> {noformat}
> - This should use a scratch directory on each of the tables locations.
> - The first SELECT will have its scratch directory on table-aes128 directory.
> - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-11-04 Thread Lars Francke (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989592#comment-14989592
 ] 

Lars Francke commented on HIVE-8065:


What's the status of this? Think we can mark it as resolved? The 
encryption-branch has been merged for version 1.1.0 and I think it's confusing 
that this is still listed as unresolved.

> Support HDFS encryption functionality on Hive
> -
>
> Key: HIVE-8065
> URL: https://issues.apache.org/jira/browse/HIVE-8065
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.13.1
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>
> The new encryption support on HDFS makes Hive incompatible and unusable when 
> this feature is used.
> HDFS encryption is designed so that an user can configure different 
> encryption zones (or directories) for multi-tenant environments. An 
> encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
> Because of security compliance, the HDFS does not allow to move/rename files 
> between encryption zones. Renames are allowed only inside the same encryption 
> zone. A copy is allowed between encryption zones.
> See HDFS-6134 for more details about HDFS encryption design.
> Hive currently uses a scratch directory (like /tmp/$user/$random). This 
> scratch directory is used for the output of intermediate data (between MR 
> jobs) and for the final output of the hive query which is later moved to the 
> table directory location.
> If Hive tables are in different encryption zones than the scratch directory, 
> then Hive won't be able to renames those files/directories, and it will make 
> Hive unusable.
> To handle this problem, we can change the scratch directory of the 
> query/statement to be inside the same encryption zone of the table directory 
> location. This way, the renaming process will be successful. 
> Also, for statements that move files between encryption zones (i.e. LOAD 
> DATA), a copy may be executed instead of a rename. This will cause an 
> overhead when copying large data files, but it won't break the encryption on 
> Hive.
> Another security thing to consider is when using joins selects. If Hive joins 
> different tables with different encryption key strengths, then the results of 
> the select might break the security compliance of the tables. Let's say two 
> tables with 128 bits and 256 bits encryption are joined, then the temporary 
> results might be stored in the 128 bits encryption zone. This will conflict 
> with the table encrypted with 256 bits temporary.
> To fix this, Hive should be able to select the scratch directory that is more 
> secured/encrypted in order to save the intermediate data temporary with no 
> compliance issues.
> For instance:
> {noformat}
> SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
> {noformat}
> - This should use a scratch directory (or staging directory) inside the 
> table-aes256 table location.
> {noformat}
> INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
> {noformat}
> - This should use a scratch directory inside the table-aes1 location.
> {noformat}
> FROM table-unencrypted
> INSERT OVERWRITE TABLE table-aes128 SELECT id, name
> INSERT OVERWRITE TABLE table-aes256 SELECT id, name
> {noformat}
> - This should use a scratch directory on each of the tables locations.
> - The first SELECT will have its scratch directory on table-aes128 directory.
> - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543952#comment-14543952
 ] 

Sergio Peña commented on HIVE-8065:
---

Hi [~mmokhtar]
I did not test encryption on an HA environment.

This error might be a configuration issue, or an issue in this code in 
{{Hadoop23Shims.java}}:
{code}
public boolean isPathEncrypted(Path path) throws IOException {
  Path fullPath;
  if (path.isAbsolute()) {
fullPath = path;
  } else {
fullPath = path.getFileSystem(conf).makeQualified(path);
  }
  return (hdfsAdmin.getEncryptionZoneForPath(fullPath) != null);
}
{code}

Maybe, {{makeQualified}} is returning the full path using {{namenode:8020}}, 
and this is causing the error on {{getEncryptionZoneForPath}}. 
Can you edit your configuration file to remove the port number, and see if that 
works?

Related with your note. The staging directory is not checked. This is the 
directory that is created and returned by the 
{{SemanticAnalyzer.getStagingDirectoryPathname}} method when a table location 
is found in an encryption zone. If not, then it just returns a normal scratch 
directory. Is there a place where you think the staging directory is checked 
for encryption?

 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 For instance:
 {noformat}
 SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
 {noformat}
 - This should use a scratch directory (or staging directory) inside the 
 table-aes256 table location.
 {noformat}
 INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
 {noformat}
 - This should use a scratch directory inside the table-aes1 location.
 {noformat}
 FROM table-unencrypted
 INSERT OVERWRITE TABLE table-aes128 SELECT id, name
 INSERT OVERWRITE TABLE table-aes256 SELECT id, name
 {noformat}
 - This should use a scratch directory on each of the tables locations.
 - The first SELECT will have its scratch directory on table-aes128 directory.
 - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-14 Thread Mostafa Mokhtar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544633#comment-14544633
 ] 

Mostafa Mokhtar commented on HIVE-8065:
---

Hi [~spena]

These are configs I have, which look correct to me.
{code}
property
  namefs.defaultFS/name
  valuehdfs://namenode:8020/value
/property
{code}

and

{code}
 property
  namedfs.namenode.rpc-address.namenode.nn1/name
  valuenode1:8020/value
/property

property
  namedfs.namenode.rpc-address.namenode.nn2/name
  valuenode2:8020/value
/property
{code}

 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 For instance:
 {noformat}
 SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
 {noformat}
 - This should use a scratch directory (or staging directory) inside the 
 table-aes256 table location.
 {noformat}
 INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
 {noformat}
 - This should use a scratch directory inside the table-aes1 location.
 {noformat}
 FROM table-unencrypted
 INSERT OVERWRITE TABLE table-aes128 SELECT id, name
 INSERT OVERWRITE TABLE table-aes256 SELECT id, name
 {noformat}
 - This should use a scratch directory on each of the tables locations.
 - The first SELECT will have its scratch directory on table-aes128 directory.
 - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-14 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544906#comment-14544906
 ] 

Brock Noland commented on HIVE-8065:


I've not seen an ha configuration specify the port in the main config. I 
believe the testing for HA would not have had the port when in HA mode. 

 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 For instance:
 {noformat}
 SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
 {noformat}
 - This should use a scratch directory (or staging directory) inside the 
 table-aes256 table location.
 {noformat}
 INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
 {noformat}
 - This should use a scratch directory inside the table-aes1 location.
 {noformat}
 FROM table-unencrypted
 INSERT OVERWRITE TABLE table-aes128 SELECT id, name
 INSERT OVERWRITE TABLE table-aes256 SELECT id, name
 {noformat}
 - This should use a scratch directory on each of the tables locations.
 - The first SELECT will have its scratch directory on table-aes128 directory.
 - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-11 Thread Mostafa Mokhtar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538548#comment-14538548
 ] 

Mostafa Mokhtar commented on HIVE-8065:
---

[~spena]

I am running on a cluster which has Namenode HA enabled and I can't run 
queries, wondering if Namenode HA was tested?

This is the call stack I am getting 
{code}
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:1866)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:1943)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStagingDirectoryPathname(SemanticAnalyzer.java:1975)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1788)
... 25 more
Caused by: java.lang.IllegalArgumentException: Wrong FS: 
hdfs://namenode:8020/apps/hive/warehouse/test_table, expected: hdfs://namenode
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1906)
at 
org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262)
at 
org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1245)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:1862)
... 28 more
{code}

I stepped through the debugger and found that the authorities mismatch 
thisAuthority namenode (id=86)
thatAuthority namenode:8020 (id=99)

Then in FileSystem.checkPath when the authorities mismatch the code falls 
through and throw the exception 
{code}
 thatAuthority = uri.getAuthority();
if (thisAuthority == thatAuthority ||   // authorities match
(thisAuthority != null  
 thisAuthority.equalsIgnoreCase(thatAuthority)))
  return;
  }
}
throw new IllegalArgumentException(Wrong FS: +path+
   , expected: +this.getUri());
{code}

Is this an issue with HDFS encryption on Hive or you think this is a 
configuration issue?

On a related note I don't think Hive should be checking if the staging 
directory is encrypted if none of the Hive managed tables are encrypted.

 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña
  Labels: Hive-Scrum

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 

[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-06 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529998#comment-14529998
 ] 

Brock Noland commented on HIVE-8065:


In that case the results of the query are staged in ez1. 

 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña
  Labels: Hive-Scrum

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 For instance:
 {noformat}
 SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
 {noformat}
 - This should use a scratch directory (or staging directory) inside the 
 table-aes256 table location.
 {noformat}
 INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
 {noformat}
 - This should use a scratch directory inside the table-aes1 location.
 {noformat}
 FROM table-unencrypted
 INSERT OVERWRITE TABLE table-aes128 SELECT id, name
 INSERT OVERWRITE TABLE table-aes256 SELECT id, name
 {noformat}
 - This should use a scratch directory on each of the tables locations.
 - The first SELECT will have its scratch directory on table-aes128 directory.
 - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531443#comment-14531443
 ] 

Sergio Peña commented on HIVE-8065:
---

Hey [~thejas]

Here's some answers about the issues:

1. If the encrypted zone where the results will be written is read-only, then 
Hive will try to use the directory set by {{hive.exec.scratchdir}} only if the 
scratch directory is encrypted as well (see HIVE-8945). This might create a 
performance issue if the encrypted scratch directory is in a different 
encryption zone. The user may change that directory to a writable directory 
inside the same encryption zone to make the move faster. This might be a little 
tedious for users, but it is the only way to protect their data.

2. This is a little tricky. Currently, hive selects the encryption zone that 
has the most strength cipher (aes128 vs aes256), and uses that location to 
store all final and intermediate results. This avoids writing intermediate data 
(aes256 to aes128), and then writing back the  final result to aes256. Here we 
have another performance issue where final result files would be copied (and 
not renamed) to the destination table as encryption zones might be different.

We did not do any work to deny access to stored results in another encryption 
zone. The solution only avoids that encrypted data touches non-encrypted zones, 
or weaker encrypted zones. Maybe other solutions, like Sentry, may work on this 
access control. But without an access control mechanism, this issue exists on 
the scratch directory, doesn't it?



 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña
  Labels: Hive-Scrum

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 For instance:
 {noformat}
 SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
 {noformat}
 - This should use a scratch directory (or staging directory) inside the 
 table-aes256 table location.
 {noformat}
 INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
 {noformat}
 - This should use a scratch directory inside the table-aes1 location.
 {noformat}
 FROM table-unencrypted
 INSERT OVERWRITE TABLE table-aes128 SELECT id, name
 INSERT OVERWRITE TABLE table-aes256 SELECT id, name
 {noformat}
 - This should use a scratch directory on each of the tables locations.
 - The first SELECT will have its scratch directory on table-aes128 directory.
 - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-06 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531401#comment-14531401
 ] 

Brock Noland commented on HIVE-8065:


bq. Write permissions are now required to read from these tables.

Sergio can comment how read-only tables are handled. We did think of this case.

bq. Sensitive data from one zone will be stored in another.

Note that file permissions are still enforced and zones are not meant to be an 
access control mechanism. For example, a user with appropriate permissions 
could copy data from one ez to another ez1. Nothing in this change, changes 
that fact.

 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña
  Labels: Hive-Scrum

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 For instance:
 {noformat}
 SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
 {noformat}
 - This should use a scratch directory (or staging directory) inside the 
 table-aes256 table location.
 {noformat}
 INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
 {noformat}
 - This should use a scratch directory inside the table-aes1 location.
 {noformat}
 FROM table-unencrypted
 INSERT OVERWRITE TABLE table-aes128 SELECT id, name
 INSERT OVERWRITE TABLE table-aes256 SELECT id, name
 {noformat}
 - This should use a scratch directory on each of the tables locations.
 - The first SELECT will have its scratch directory on table-aes128 directory.
 - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-06 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531038#comment-14531038
 ] 

Thejas M Nair commented on HIVE-8065:
-

Yes, there seems to be a performance vs data protection tradeoff here. 
Lets say, the EZ1 is supposed to have only aggregated information and users 
expect people given access to EZ1 to have only access to aggregated information.
But the temporary data (pre aggregation) can have more sensitive information 
from input EZ2 and EZ3, that EZ1 users can potentially access.


 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña
  Labels: Hive-Scrum

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 For instance:
 {noformat}
 SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
 {noformat}
 - This should use a scratch directory (or staging directory) inside the 
 table-aes256 table location.
 {noformat}
 INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
 {noformat}
 - This should use a scratch directory inside the table-aes1 location.
 {noformat}
 FROM table-unencrypted
 INSERT OVERWRITE TABLE table-aes128 SELECT id, name
 INSERT OVERWRITE TABLE table-aes256 SELECT id, name
 {noformat}
 - This should use a scratch directory on each of the tables locations.
 - The first SELECT will have its scratch directory on table-aes128 directory.
 - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-06 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531055#comment-14531055
 ] 

Thejas M Nair commented on HIVE-8065:
-

Thinking more about above insert case, the performance tradeoff is not 
necessary. The files written to hdfs before move can still be written to EZ1 
without any reduction in data protection, as  it would contain data matching 
the final results. It is the intermediate data before that that can contain 
sensitive data (in case of MR mode).

In case of select * from tableEZ2 inner join tableEZ3 , my understanding is 
it uses one of EZ2 or EZ3 for scratch dir. This creates two issues- 
 # Write permissions are now required to read from these tables.
 # Sensitive data from one zone will be stored in another. 


 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña
  Labels: Hive-Scrum

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 For instance:
 {noformat}
 SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
 {noformat}
 - This should use a scratch directory (or staging directory) inside the 
 table-aes256 table location.
 {noformat}
 INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
 {noformat}
 - This should use a scratch directory inside the table-aes1 location.
 {noformat}
 FROM table-unencrypted
 INSERT OVERWRITE TABLE table-aes128 SELECT id, name
 INSERT OVERWRITE TABLE table-aes256 SELECT id, name
 {noformat}
 - This should use a scratch directory on each of the tables locations.
 - The first SELECT will have its scratch directory on table-aes128 directory.
 - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-05 Thread Eugene Koifman (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529892#comment-14529892
 ] 

Eugene Koifman commented on HIVE-8065:
--

How come the move restriction is not an issue for something like  Insert 
Overwrite tableEZ1 select * from tableEZ2 inner join tableEZ3?

 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña
  Labels: Hive-Scrum

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 For instance:
 {noformat}
 SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
 {noformat}
 - This should use a scratch directory (or staging directory) inside the 
 table-aes256 table location.
 {noformat}
 INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
 {noformat}
 - This should use a scratch directory inside the table-aes1 location.
 {noformat}
 FROM table-unencrypted
 INSERT OVERWRITE TABLE table-aes128 SELECT id, name
 INSERT OVERWRITE TABLE table-aes256 SELECT id, name
 {noformat}
 - This should use a scratch directory on each of the tables locations.
 - The first SELECT will have its scratch directory on table-aes128 directory.
 - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-05 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529461#comment-14529461
 ] 

Brock Noland commented on HIVE-8065:


bq. have you considered creating a single encrypted staging dir for all queries 
to use instead of creating new ones under the table namespace? (this could be 
owned by Hive and encrypted with Hive's key). If so, why did you choose the 
current design?

This approach does not work since you cannot move files across encryption zones.

 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña
  Labels: Hive-Scrum

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 For instance:
 {noformat}
 SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
 {noformat}
 - This should use a scratch directory (or staging directory) inside the 
 table-aes256 table location.
 {noformat}
 INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
 {noformat}
 - This should use a scratch directory inside the table-aes1 location.
 {noformat}
 FROM table-unencrypted
 INSERT OVERWRITE TABLE table-aes128 SELECT id, name
 INSERT OVERWRITE TABLE table-aes256 SELECT id, name
 {noformat}
 - This should use a scratch directory on each of the tables locations.
 - The first SELECT will have its scratch directory on table-aes128 directory.
 - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8065) Support HDFS encryption functionality on Hive

2015-05-05 Thread Eugene Koifman (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529378#comment-14529378
 ] 

Eugene Koifman commented on HIVE-8065:
--

[~spena], when implementing this, have you considered creating a single 
encrypted staging dir for all queries to use instead of creating new ones under 
the table namespace?  (this could be owned by Hive and encrypted with Hive's 
key).  If so, why did you choose the current design?

Some possible issues with current design:
Requires write permission on the table dir
delete-on-exit (on stagingdir) is not completely reliable as far as I know.  
This may leave files around
in a query like SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id 
== t2.id; when staging dir is created under table-aes256, someone how has a 
key for this EZ may read data (in theory at least) that came from table-aes128 
even if they don't have a key for EZ which contains table-aes128.

thanks

 Support HDFS encryption functionality on Hive
 -

 Key: HIVE-8065
 URL: https://issues.apache.org/jira/browse/HIVE-8065
 Project: Hive
  Issue Type: Improvement
Affects Versions: 0.13.1
Reporter: Sergio Peña
Assignee: Sergio Peña
  Labels: Hive-Scrum

 The new encryption support on HDFS makes Hive incompatible and unusable when 
 this feature is used.
 HDFS encryption is designed so that an user can configure different 
 encryption zones (or directories) for multi-tenant environments. An 
 encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
 Because of security compliance, the HDFS does not allow to move/rename files 
 between encryption zones. Renames are allowed only inside the same encryption 
 zone. A copy is allowed between encryption zones.
 See HDFS-6134 for more details about HDFS encryption design.
 Hive currently uses a scratch directory (like /tmp/$user/$random). This 
 scratch directory is used for the output of intermediate data (between MR 
 jobs) and for the final output of the hive query which is later moved to the 
 table directory location.
 If Hive tables are in different encryption zones than the scratch directory, 
 then Hive won't be able to renames those files/directories, and it will make 
 Hive unusable.
 To handle this problem, we can change the scratch directory of the 
 query/statement to be inside the same encryption zone of the table directory 
 location. This way, the renaming process will be successful. 
 Also, for statements that move files between encryption zones (i.e. LOAD 
 DATA), a copy may be executed instead of a rename. This will cause an 
 overhead when copying large data files, but it won't break the encryption on 
 Hive.
 Another security thing to consider is when using joins selects. If Hive joins 
 different tables with different encryption key strengths, then the results of 
 the select might break the security compliance of the tables. Let's say two 
 tables with 128 bits and 256 bits encryption are joined, then the temporary 
 results might be stored in the 128 bits encryption zone. This will conflict 
 with the table encrypted with 256 bits temporary.
 To fix this, Hive should be able to select the scratch directory that is more 
 secured/encrypted in order to save the intermediate data temporary with no 
 compliance issues.
 For instance:
 {noformat}
 SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
 {noformat}
 - This should use a scratch directory (or staging directory) inside the 
 table-aes256 table location.
 {noformat}
 INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
 {noformat}
 - This should use a scratch directory inside the table-aes1 location.
 {noformat}
 FROM table-unencrypted
 INSERT OVERWRITE TABLE table-aes128 SELECT id, name
 INSERT OVERWRITE TABLE table-aes256 SELECT id, name
 {noformat}
 - This should use a scratch directory on each of the tables locations.
 - The first SELECT will have its scratch directory on table-aes128 directory.
 - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)