Re: Review Request 65478: HIVE-18553 VectorizedParquetReader fails after adding a new column to table

2018-02-04 Thread Jerry Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65478/#review196793
---




ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/ParquetDataColumnReaderFactory.java
Lines 92 (patched)
<https://reviews.apache.org/r/65478/#comment276629>

For the types that doesn't support any type conversion, we can simply 
return the realReader, instead of a wrapper but without do anything around it.


- Jerry Chen


On Feb. 2, 2018, 8:46 a.m., cheng xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65478/
> ---
> 
> (Updated Feb. 2, 2018, 8:46 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> VectorizedParquetReader throws an exception when trying to reading from a 
> parquet table on which new columns are added.
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/BaseVectorizedColumnReader.java
>  907a9b8 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/DefaultParquetDataColumnReader.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/ParquetDataColumnReaderFactory.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedDummyColumnReader.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java
>  08ac57b 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedColumnReader.java
>  9e414dc 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedDictionaryEncodingColumnReader.java
>  3e5d831 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java
>  5d3ebd6 
>   ql/src/test/queries/clientpositive/schema_evol_par_vec_table.q PRE-CREATION 
>   ql/src/test/results/clientpositive/schema_evol_par_vec_table.q.out 
> PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/65478/diff/1/
> 
> 
> Testing
> ---
> 
> Newly added UT passed and qtest passed locally.
> 
> 
> Thanks,
> 
> cheng xu
> 
>



Re: Review Request 65478: HIVE-18553 VectorizedParquetReader fails after adding a new column to table

2018-02-05 Thread Jerry Chen


> On Feb. 5, 2018, 5:46 p.m., Vihang Karajgaonkar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/ParquetDataColumnReaderFactory.java
> > Lines 24 (patched)
> > 
> >
> > Do we need to override the methods for other reads as well? What is the 
> > criteria to identify the methods which need to be overriden for this and 
> > TypesFromInt64PageReader?

Current Ferdinand follows the same conversion principles of ETypeConverter in 
Hive. The basic conversion implemented in ETypeConverter is: low precision data 
type can convert to high precison data type. for int32: it can be converted to 
int64, float, double. for int64: it can be converted to float, double. For 
float: it can be converted to double.

Although other conversions can logically supported such as int64 to int32 or 
double to float. But that is just not always safe.


- Jerry


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65478/#review196718
---


On Feb. 5, 2018, 8:46 a.m., cheng xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65478/
> ---
> 
> (Updated Feb. 5, 2018, 8:46 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> VectorizedParquetReader throws an exception when trying to reading from a 
> parquet table on which new columns are added.
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/BaseVectorizedColumnReader.java
>  907a9b8 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/ParquetDataColumnReaderFactory.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedDummyColumnReader.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java
>  08ac57b 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedColumnReader.java
>  9e414dc 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java
>  5d3ebd6 
>   ql/src/test/queries/clientpositive/schema_evol_par_vec_table.q PRE-CREATION 
>   ql/src/test/results/clientpositive/schema_evol_par_vec_table.q.out 
> PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/65478/diff/2/
> 
> 
> Testing
> ---
> 
> Newly added UT passed and qtest passed locally.
> 
> 
> Thanks,
> 
> cheng xu
> 
>



[jira] [Created] (HIVE-5207) Support data encryption for Hive tables

2013-09-03 Thread Jerry Chen (JIRA)
Jerry Chen created HIVE-5207:


 Summary: Support data encryption for Hive tables
 Key: HIVE-5207
 URL: https://issues.apache.org/jira/browse/HIVE-5207
 Project: Hive
  Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Jerry Chen


For sensitive and legally protected data such as personal information, it is a 
common practice that the data is stored encrypted in the file system. To enable 
Hive with the ability to store and query the encrypted data is very crucial for 
Hive data analysis in enterprise. 
 
When creating table, user can specify whether a table is an encrypted table or 
not by specify a property in TBLPROPERTIES. Once an encrypted table is created, 
query on the encrypted table is transparent as long as the corresponding key 
management facilities are set in the running environment of query. We can use 
hadoop crypto provided by HADOOP-9331 for underlying data encryption and 
decryption. 
 
As to key management, we would support several common key management use cases. 
First, the table key (data key) can be stored in the Hive metastore associated 
with the table in properties. The table key can be explicit specified or auto 
generated and will be encrypted with a master key. There are cases that the 
data being processed is generated by other applications, we need to support 
externally managed or imported table keys. Also, the data generated by Hive may 
be consumed by other applications in the system. We need to a tool or command 
for exporting the table key to a java keystore for using externally.
 
To handle versions of Hadoop that do not have crypto support, we can avoid 
compilation problems by segregating crypto API usage into separate files 
(shims) to be included only if a flag is defined on the Ant command line 
(something like –Dcrypto=true).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables

2013-09-18 Thread Jerry Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770520#comment-13770520
 ] 

Jerry Chen commented on HIVE-5207:
--

HIVE-4227 is specifically for adding column level encryption for ORC files. As 
we all know, Hive tables support various other formats such as text file, 
sequence file, RC file and Avro file. HIVE-5207 is targeting to address these 
file formats as a common problem for supporting encryption.

As to key management, from the user’s perspective one rational approach is one 
table key per encrypted table. In this concept, it is natural to associate the 
key with TblProperties

 Support data encryption for Hive tables
 ---

 Key: HIVE-5207
 URL: https://issues.apache.org/jira/browse/HIVE-5207
 Project: Hive
  Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Jerry Chen
  Labels: Rhino
   Original Estimate: 504h
  Remaining Estimate: 504h

 For sensitive and legally protected data such as personal information, it is 
 a common practice that the data is stored encrypted in the file system. To 
 enable Hive with the ability to store and query the encrypted data is very 
 crucial for Hive data analysis in enterprise. 
  
 When creating table, user can specify whether a table is an encrypted table 
 or not by specify a property in TBLPROPERTIES. Once an encrypted table is 
 created, query on the encrypted table is transparent as long as the 
 corresponding key management facilities are set in the running environment of 
 query. We can use hadoop crypto provided by HADOOP-9331 for underlying data 
 encryption and decryption. 
  
 As to key management, we would support several common key management use 
 cases. First, the table key (data key) can be stored in the Hive metastore 
 associated with the table in properties. The table key can be explicit 
 specified or auto generated and will be encrypted with a master key. There 
 are cases that the data being processed is generated by other applications, 
 we need to support externally managed or imported table keys. Also, the data 
 generated by Hive may be consumed by other applications in the system. We 
 need to a tool or command for exporting the table key to a java keystore for 
 using externally.
  
 To handle versions of Hadoop that do not have crypto support, we can avoid 
 compilation problems by segregating crypto API usage into separate files 
 (shims) to be included only if a flag is defined on the Ant command line 
 (something like –Dcrypto=true).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-5207) Support data encryption for Hive tables

2013-09-25 Thread Jerry Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Chen updated HIVE-5207:
-

Attachment: HIVE-5207.patch

Attach patch for reference. It depends on hadoop crypto feature.

 Support data encryption for Hive tables
 ---

 Key: HIVE-5207
 URL: https://issues.apache.org/jira/browse/HIVE-5207
 Project: Hive
  Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Jerry Chen
  Labels: Rhino
 Attachments: HIVE-5207.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 For sensitive and legally protected data such as personal information, it is 
 a common practice that the data is stored encrypted in the file system. To 
 enable Hive with the ability to store and query the encrypted data is very 
 crucial for Hive data analysis in enterprise. 
  
 When creating table, user can specify whether a table is an encrypted table 
 or not by specify a property in TBLPROPERTIES. Once an encrypted table is 
 created, query on the encrypted table is transparent as long as the 
 corresponding key management facilities are set in the running environment of 
 query. We can use hadoop crypto provided by HADOOP-9331 for underlying data 
 encryption and decryption. 
  
 As to key management, we would support several common key management use 
 cases. First, the table key (data key) can be stored in the Hive metastore 
 associated with the table in properties. The table key can be explicit 
 specified or auto generated and will be encrypted with a master key. There 
 are cases that the data being processed is generated by other applications, 
 we need to support externally managed or imported table keys. Also, the data 
 generated by Hive may be consumed by other applications in the system. We 
 need to a tool or command for exporting the table key to a java keystore for 
 using externally.
  
 To handle versions of Hadoop that do not have crypto support, we can avoid 
 compilation problems by segregating crypto API usage into separate files 
 (shims) to be included only if a flag is defined on the Ant command line 
 (something like –Dcrypto=true).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-5207) Support data encryption for Hive tables

2013-09-26 Thread Jerry Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Chen updated HIVE-5207:
-

Attachment: HIVE-5207.patch

Correct the typo pointed out by Larry.
Thanks Larry.

 Support data encryption for Hive tables
 ---

 Key: HIVE-5207
 URL: https://issues.apache.org/jira/browse/HIVE-5207
 Project: Hive
  Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Jerry Chen
  Labels: Rhino
 Attachments: HIVE-5207.patch, HIVE-5207.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 For sensitive and legally protected data such as personal information, it is 
 a common practice that the data is stored encrypted in the file system. To 
 enable Hive with the ability to store and query the encrypted data is very 
 crucial for Hive data analysis in enterprise. 
  
 When creating table, user can specify whether a table is an encrypted table 
 or not by specify a property in TBLPROPERTIES. Once an encrypted table is 
 created, query on the encrypted table is transparent as long as the 
 corresponding key management facilities are set in the running environment of 
 query. We can use hadoop crypto provided by HADOOP-9331 for underlying data 
 encryption and decryption. 
  
 As to key management, we would support several common key management use 
 cases. First, the table key (data key) can be stored in the Hive metastore 
 associated with the table in properties. The table key can be explicit 
 specified or auto generated and will be encrypted with a master key. There 
 are cases that the data being processed is generated by other applications, 
 we need to support externally managed or imported table keys. Also, the data 
 generated by Hive may be consumed by other applications in the system. We 
 need to a tool or command for exporting the table key to a java keystore for 
 using externally.
  
 To handle versions of Hadoop that do not have crypto support, we can avoid 
 compilation problems by segregating crypto API usage into separate files 
 (shims) to be included only if a flag is defined on the Ant command line 
 (something like –Dcrypto=true).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-3934) Put tag in value for join with map reduce

2013-01-23 Thread Jerry Chen (JIRA)
Jerry Chen created HIVE-3934:


 Summary: Put tag in value for join with map reduce
 Key: HIVE-3934
 URL: https://issues.apache.org/jira/browse/HIVE-3934
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor, Serializers/Deserializers
Affects Versions: 0.11.0
Reporter: Jerry Chen


While trying to facilitate hash-based map reduce, I found that for join with 
map reduce in hive, the tag is appended to the key writable. This is quite a 
hinder for facilitating other runtime map reduce implementation of map reduce 
computation model such as hash-based map reduce. For example, whent the tag was 
in the key, there are some special things must be cared, such as,

1. HiveKey must handles specially for the hash code for properly partition the 
keys between the reduce.
2. The key in map reduce's view is actually key + tag and which makes map 
reduce sort a compulsory to satisfy the need of hive to group the key in reduce 
side. This disables or hinders hash-based map reduce because group by key + tag 
will make no sense to hive. 
3. ExecReducer must check the real key boundary by stripping out the tag for 
startGroup and endGroup calls to the operator. While without the tag, each 
reduce call is a natural key boundary.

Considering append the tag as the last byte to the value writable which can 
avoid all the above things and fit naturually to map reduce computation model.

I see the code in JoinOperator which will generate join results ealier which 
assumes on the fact that the tag is sorted. This only useful when there are 
many many rows with the same key in both join tables which is not a compulsory 
for most cases.

Let's disucss the possibiblity of tag in value approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables

2013-10-13 Thread Jerry Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13793897#comment-13793897
 ] 

Jerry Chen commented on HIVE-5207:
--

Hi Larry, thanks for you pointing out the docs. Yes, we will complement more 
javadocs and document as our next work.
 
{quote}1. TwoTieredKey - exactly the purpose, how it's used what the tiers are, 
etc{quote}
TwoTiredKey is used for the case that the table key is stored in the Hive 
metastore. The table key will be encrypted with the master key which is 
provided externally. In this case, user maintains and manages only the master 
key externally other than manages all the table keys externally. This is useful 
when there is no full-fledged key management system available.
 
{quote}2. External KeyManagement integration - where and what is the expected 
contract for this integration{quote}
To integrate with external key management system, we use the KeyProvider 
interface in HADOOP-9331. Implementation of KeyProvider interface for a 
specified key management system can be set as KeyProvider for retrieving key.
 
{quote}3. A specific usecase description for exporting keys into an external 
keystore and who has the authority to initiate the export and where the 
password comes from{quote}
Exporting of the internal keys comes with the Hive command line. As the 
internal table keys were encrypted with the master key, when performing the 
exporting, the master key must be provided in the environment which is 
controlled by the user.  If the master key is not available, the encrypted 
table keys for exporting cannot be decrypted and thus cannot be exported. The 
KeyProvider implementation for retrieving master key can provide its own 
authentication and authorization for deciding whether the current user has 
access to a specific key.
 
{quote}4. An explanation as to why we should ever store the key with the data 
which seems like a bad idea. I understand that it is encrypted with the master 
secret - which takes me to the next question.  {quote}
Exactly speaking, it is not with the data. The table key is stored in the Hive 
metastore. I see your points at this question. Just as mentioned, for use cases 
that there is no full-fledged and ready to use key management system available, 
it is useful. We provide several alternatives for managing keys. When creating 
an encrypted table, user can specify whether the key is managed externally or 
internally. For externally managed keys, only the key name (alias) will be 
stored in the Hive metastore and the key will be retrieved through KeyProvider 
set in the configuration.
 
{quote}5. Where is the master secret established and stored and how is it 
protected{quote}
Currently, we assume that the user manages the master key. For example, for 
simple uses cases, he can stores the master key in java KeyStore which 
protected by a password and stores in the folder which is read-only for 
specific user or groups. User can also stores the master key in other key 
management system as the master key is retrieved through KeyProvider.
 
Really appreciate your time reviewing this.
Thanks

 Support data encryption for Hive tables
 ---

 Key: HIVE-5207
 URL: https://issues.apache.org/jira/browse/HIVE-5207
 Project: Hive
  Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Jerry Chen
  Labels: Rhino
 Attachments: HIVE-5207.patch, HIVE-5207.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 For sensitive and legally protected data such as personal information, it is 
 a common practice that the data is stored encrypted in the file system. To 
 enable Hive with the ability to store and query the encrypted data is very 
 crucial for Hive data analysis in enterprise. 
  
 When creating table, user can specify whether a table is an encrypted table 
 or not by specify a property in TBLPROPERTIES. Once an encrypted table is 
 created, query on the encrypted table is transparent as long as the 
 corresponding key management facilities are set in the running environment of 
 query. We can use hadoop crypto provided by HADOOP-9331 for underlying data 
 encryption and decryption. 
  
 As to key management, we would support several common key management use 
 cases. First, the table key (data key) can be stored in the Hive metastore 
 associated with the table in properties. The table key can be explicit 
 specified or auto generated and will be encrypted with a master key. There 
 are cases that the data being processed is generated by other applications, 
 we need to support externally managed or imported table keys. Also, the data 
 generated by Hive may be consumed by other applications in the system. We 
 need to a tool or command for exporting the table key to a java keystore for 
 using externally

[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables

2013-10-14 Thread Jerry Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794816#comment-13794816
 ] 

Jerry Chen commented on HIVE-5207:
--

{quote}This patch won't compile, because Hive has to work when used with Hadoop 
1.x. The shims are used to support multiple versions of Hadoop (Hadoop 0.20, 
Hadoop 1.x, Hadoop 0.23, Hadoop 2.x) depending on what is install on the host 
system.{quote}
This patch depends on the crypto feature added by HADOOP-9331 and others. For 
this patch to compile with the Hadoop versions that does included crypto 
feature, we need to add flag –Dcrypto=true to disable this feature. I do 
understand that this approach still doesn’t align with the target of support 
multiple versions for a single compile of Hive. 
{quote}Furthermore, this seems likes the wrong direction. What is the advantage 
of this rather large patch over using the cfs work? If the user defines a table 
in cfs all of the table's data will be encrypted.{quote}
I agree that cfs work has its value on transparency on API and it is a good 
stuff. We are working on CFS and it is currently not available yet. And the 
work here is already in use by our users who are using this to protect 
sensitive data on their clusters while being able to transparently decrypt the 
data while running jobs that process this encrypted data. 

On the other hand, we see that compression codec is already widely used for 
various file formats used by Hive. The issue here may be the current approach 
depends on the changes to specific file formats for handling encryption key 
contexts. One possible direction is to make the encryption codec strictly the 
same as compression codec so that Hive can utilize a codec doing encryption or 
decryption without any changes to file formats and Hive. If we can do that, it 
adds the same value as compression codec.  


 Support data encryption for Hive tables
 ---

 Key: HIVE-5207
 URL: https://issues.apache.org/jira/browse/HIVE-5207
 Project: Hive
  Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Jerry Chen
  Labels: Rhino
 Attachments: HIVE-5207.patch, HIVE-5207.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 For sensitive and legally protected data such as personal information, it is 
 a common practice that the data is stored encrypted in the file system. To 
 enable Hive with the ability to store and query the encrypted data is very 
 crucial for Hive data analysis in enterprise. 
  
 When creating table, user can specify whether a table is an encrypted table 
 or not by specify a property in TBLPROPERTIES. Once an encrypted table is 
 created, query on the encrypted table is transparent as long as the 
 corresponding key management facilities are set in the running environment of 
 query. We can use hadoop crypto provided by HADOOP-9331 for underlying data 
 encryption and decryption. 
  
 As to key management, we would support several common key management use 
 cases. First, the table key (data key) can be stored in the Hive metastore 
 associated with the table in properties. The table key can be explicit 
 specified or auto generated and will be encrypted with a master key. There 
 are cases that the data being processed is generated by other applications, 
 we need to support externally managed or imported table keys. Also, the data 
 generated by Hive may be consumed by other applications in the system. We 
 need to a tool or command for exporting the table key to a java keystore for 
 using externally.
  
 To handle versions of Hadoop that do not have crypto support, we can avoid 
 compilation problems by segregating crypto API usage into separate files 
 (shims) to be included only if a flag is defined on the Ant command line 
 (something like –Dcrypto=true).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HIVE-17632) Build Hive with JDK9

2017-09-28 Thread Jerry Chen (JIRA)
Jerry Chen created HIVE-17632:
-

 Summary: Build Hive with JDK9
 Key: HIVE-17632
 URL: https://issues.apache.org/jira/browse/HIVE-17632
 Project: Hive
  Issue Type: Improvement
  Components: Build Infrastructure
Affects Versions: 3.0.0
Reporter: Jerry Chen


JDK 9 has been released recently with a lot of improvements such as the support 
of AVX 512 which can bring performance benefits running on Skylake servers.
We would expect that the users will soon to try JDK9 and will build Hadoop on 
it. Currently it's not clear what issues will the user have to build Hive on 
JDK9. The JIRA can serve as the umbrella JIRA to track all these issues.

http://jdk.java.net/9/




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)