from:"Jerry Chen"

Re: Review Request 65478: HIVE-18553 VectorizedParquetReader fails after adding a new column to table

2018-02-04 Thread Jerry Chen


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65478/#review196793
---




ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/ParquetDataColumnReaderFactory.java
Lines 92 (patched)
<https://reviews.apache.org/r/65478/#comment276629>

For the types that doesn't support any type conversion, we can simply 
return the realReader, instead of a wrapper but without do anything around it.


- Jerry Chen


On Feb. 2, 2018, 8:46 a.m., cheng xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65478/
> ---
> 
> (Updated Feb. 2, 2018, 8:46 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> VectorizedParquetReader throws an exception when trying to reading from a 
> parquet table on which new columns are added.
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/BaseVectorizedColumnReader.java
>  907a9b8 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/DefaultParquetDataColumnReader.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/ParquetDataColumnReaderFactory.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedDummyColumnReader.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java
>  08ac57b 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedColumnReader.java
>  9e414dc 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedDictionaryEncodingColumnReader.java
>  3e5d831 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java
>  5d3ebd6 
>   ql/src/test/queries/clientpositive/schema_evol_par_vec_table.q PRE-CREATION 
>   ql/src/test/results/clientpositive/schema_evol_par_vec_table.q.out 
> PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/65478/diff/1/
> 
> 
> Testing
> ---
> 
> Newly added UT passed and qtest passed locally.
> 
> 
> Thanks,
> 
> cheng xu
> 
>

Re: Review Request 65478: HIVE-18553 VectorizedParquetReader fails after adding a new column to table

2018-02-05 Thread Jerry Chen



> On Feb. 5, 2018, 5:46 p.m., Vihang Karajgaonkar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/ParquetDataColumnReaderFactory.java
> > Lines 24 (patched)
> > 
> >
> > Do we need to override the methods for other reads as well? What is the 
> > criteria to identify the methods which need to be overriden for this and 
> > TypesFromInt64PageReader?

Current Ferdinand follows the same conversion principles of ETypeConverter in 
Hive. The basic conversion implemented in ETypeConverter is: low precision data 
type can convert to high precison data type. for int32: it can be converted to 
int64, float, double. for int64: it can be converted to float, double. For 
float: it can be converted to double.

Although other conversions can logically supported such as int64 to int32 or 
double to float. But that is just not always safe.


- Jerry


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65478/#review196718
---


On Feb. 5, 2018, 8:46 a.m., cheng xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65478/
> ---
> 
> (Updated Feb. 5, 2018, 8:46 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> VectorizedParquetReader throws an exception when trying to reading from a 
> parquet table on which new columns are added.
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/BaseVectorizedColumnReader.java
>  907a9b8 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/ParquetDataColumnReaderFactory.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedDummyColumnReader.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java
>  08ac57b 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedColumnReader.java
>  9e414dc 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java
>  5d3ebd6 
>   ql/src/test/queries/clientpositive/schema_evol_par_vec_table.q PRE-CREATION 
>   ql/src/test/results/clientpositive/schema_evol_par_vec_table.q.out 
> PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/65478/diff/2/
> 
> 
> Testing
> ---
> 
> Newly added UT passed and qtest passed locally.
> 
> 
> Thanks,
> 
> cheng xu
> 
>

[jira] [Created] (HIVE-5207) Support data encryption for Hive tables

2013-09-03 Thread Jerry Chen (JIRA)

Jerry Chen created HIVE-5207:


 Summary: Support data encryption for Hive tables
 Key: HIVE-5207
 URL: https://issues.apache.org/jira/browse/HIVE-5207
 Project: Hive
  Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Jerry Chen


For sensitive and legally protected data such as personal information, it is a 
common practice that the data is stored encrypted in the file system. To enable 
Hive with the ability to store and query the encrypted data is very crucial for 
Hive data analysis in enterprise. 
 
When creating table, user can specify whether a table is an encrypted table or 
not by specify a property in TBLPROPERTIES. Once an encrypted table is created, 
query on the encrypted table is transparent as long as the corresponding key 
management facilities are set in the running environment of query. We can use 
hadoop crypto provided by HADOOP-9331 for underlying data encryption and 
decryption. 
 
As to key management, we would support several common key management use cases. 
First, the table key (data key) can be stored in the Hive metastore associated 
with the table in properties. The table key can be explicit specified or auto 
generated and will be encrypted with a master key. There are cases that the 
data being processed is generated by other applications, we need to support 
externally managed or imported table keys. Also, the data generated by Hive may 
be consumed by other applications in the system. We need to a tool or command 
for exporting the table key to a java keystore for using externally.
 
To handle versions of Hadoop that do not have crypto support, we can avoid 
compilation problems by segregating crypto API usage into separate files 
(shims) to be included only if a flag is defined on the Ant command line 
(something like –Dcrypto=true).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables

2013-09-18 Thread Jerry Chen (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770520#comment-13770520
]

Jerry Chen commented on HIVE-5207:
--

HIVE-4227 is specifically for adding column level encryption for ORC files. As
we all know, Hive tables support various other formats such as text file,
sequence file, RC file and Avro file. HIVE-5207 is targeting to address these
file formats as a common problem for supporting encryption.

As to key management, from the user’s perspective one rational approach is one
table key per encrypted table. In this concept, it is natural to associate the
key with TblProperties

Support data encryption for Hive tables
---

Key: HIVE-5207
URL: https://issues.apache.org/jira/browse/HIVE-5207
Project: Hive
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Jerry Chen
Labels: Rhino
Original Estimate: 504h
Remaining Estimate: 504h

For sensitive and legally protected data such as personal information, it is
a common practice that the data is stored encrypted in the file system. To
enable Hive with the ability to store and query the encrypted data is very
crucial for Hive data analysis in enterprise.

When creating table, user can specify whether a table is an encrypted table
or not by specify a property in TBLPROPERTIES. Once an encrypted table is
created, query on the encrypted table is transparent as long as the
corresponding key management facilities are set in the running environment of
query. We can use hadoop crypto provided by HADOOP-9331 for underlying data
encryption and decryption.

As to key management, we would support several common key management use
cases. First, the table key (data key) can be stored in the Hive metastore
associated with the table in properties. The table key can be explicit
specified or auto generated and will be encrypted with a master key. There
are cases that the data being processed is generated by other applications,
we need to support externally managed or imported table keys. Also, the data
generated by Hive may be consumed by other applications in the system. We
need to a tool or command for exporting the table key to a java keystore for
using externally.

To handle versions of Hadoop that do not have crypto support, we can avoid
compilation problems by segregating crypto API usage into separate files
(shims) to be included only if a flag is defined on the Ant command line
(something like –Dcrypto=true).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-5207) Support data encryption for Hive tables

2013-09-25 Thread Jerry Chen (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jerry Chen updated HIVE-5207:
-

Attachment: HIVE-5207.patch

Attach patch for reference. It depends on hadoop crypto feature.

Support data encryption for Hive tables
---

Key: HIVE-5207
URL: https://issues.apache.org/jira/browse/HIVE-5207
Project: Hive
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Jerry Chen
Labels: Rhino
Attachments: HIVE-5207.patch

Original Estimate: 504h
Remaining Estimate: 504h

[jira] [Updated] (HIVE-5207) Support data encryption for Hive tables

2013-09-26 Thread Jerry Chen (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jerry Chen updated HIVE-5207:
-

Attachment: HIVE-5207.patch

Correct the typo pointed out by Larry.
Thanks Larry.

Support data encryption for Hive tables
---

Key: HIVE-5207
URL: https://issues.apache.org/jira/browse/HIVE-5207
Project: Hive
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Jerry Chen
Labels: Rhino
Attachments: HIVE-5207.patch, HIVE-5207.patch

Original Estimate: 504h
Remaining Estimate: 504h

[jira] [Created] (HIVE-3934) Put tag in value for join with map reduce

2013-01-23 Thread Jerry Chen (JIRA)

Jerry Chen created HIVE-3934:


 Summary: Put tag in value for join with map reduce
 Key: HIVE-3934
 URL: https://issues.apache.org/jira/browse/HIVE-3934
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor, Serializers/Deserializers
Affects Versions: 0.11.0
Reporter: Jerry Chen


While trying to facilitate hash-based map reduce, I found that for join with 
map reduce in hive, the tag is appended to the key writable. This is quite a 
hinder for facilitating other runtime map reduce implementation of map reduce 
computation model such as hash-based map reduce. For example, whent the tag was 
in the key, there are some special things must be cared, such as,

1. HiveKey must handles specially for the hash code for properly partition the 
keys between the reduce.
2. The key in map reduce's view is actually key + tag and which makes map 
reduce sort a compulsory to satisfy the need of hive to group the key in reduce 
side. This disables or hinders hash-based map reduce because group by key + tag 
will make no sense to hive. 
3. ExecReducer must check the real key boundary by stripping out the tag for 
startGroup and endGroup calls to the operator. While without the tag, each 
reduce call is a natural key boundary.

Considering append the tag as the last byte to the value writable which can 
avoid all the above things and fit naturually to map reduce computation model.

I see the code in JoinOperator which will generate join results ealier which 
assumes on the fact that the tag is sorted. This only useful when there are 
many many rows with the same key in both join tables which is not a compulsory 
for most cases.

Let's disucss the possibiblity of tag in value approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables

2013-10-13 Thread Jerry Chen (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13793897#comment-13793897
]

Jerry Chen commented on HIVE-5207:
--

Hi Larry, thanks for you pointing out the docs. Yes, we will complement more
javadocs and document as our next work.

{quote}1. TwoTieredKey - exactly the purpose, how it's used what the tiers are,
etc{quote}
TwoTiredKey is used for the case that the table key is stored in the Hive
metastore. The table key will be encrypted with the master key which is
provided externally. In this case, user maintains and manages only the master
key externally other than manages all the table keys externally. This is useful
when there is no full-fledged key management system available.

{quote}2. External KeyManagement integration - where and what is the expected
contract for this integration{quote}
To integrate with external key management system, we use the KeyProvider
interface in HADOOP-9331. Implementation of KeyProvider interface for a
specified key management system can be set as KeyProvider for retrieving key.

{quote}3. A specific usecase description for exporting keys into an external
keystore and who has the authority to initiate the export and where the
password comes from{quote}
Exporting of the internal keys comes with the Hive command line. As the
internal table keys were encrypted with the master key, when performing the
exporting, the master key must be provided in the environment which is
controlled by the user. If the master key is not available, the encrypted
table keys for exporting cannot be decrypted and thus cannot be exported. The
KeyProvider implementation for retrieving master key can provide its own
authentication and authorization for deciding whether the current user has
access to a specific key.

{quote}4. An explanation as to why we should ever store the key with the data
which seems like a bad idea. I understand that it is encrypted with the master
secret - which takes me to the next question. {quote}
Exactly speaking, it is not with the data. The table key is stored in the Hive
metastore. I see your points at this question. Just as mentioned, for use cases
that there is no full-fledged and ready to use key management system available,
it is useful. We provide several alternatives for managing keys. When creating
an encrypted table, user can specify whether the key is managed externally or
internally. For externally managed keys, only the key name (alias) will be
stored in the Hive metastore and the key will be retrieved through KeyProvider
set in the configuration.

{quote}5. Where is the master secret established and stored and how is it
protected{quote}
Currently, we assume that the user manages the master key. For example, for
simple uses cases, he can stores the master key in java KeyStore which
protected by a password and stores in the folder which is read-only for
specific user or groups. User can also stores the master key in other key
management system as the master key is retrieved through KeyProvider.

Really appreciate your time reviewing this.
Thanks

Support data encryption for Hive tables
---

Original Estimate: 504h
Remaining Estimate: 504h

[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables

2013-10-14 Thread Jerry Chen (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794816#comment-13794816
]

Jerry Chen commented on HIVE-5207:
--

{quote}This patch won't compile, because Hive has to work when used with Hadoop
1.x. The shims are used to support multiple versions of Hadoop (Hadoop 0.20,
Hadoop 1.x, Hadoop 0.23, Hadoop 2.x) depending on what is install on the host
system.{quote}
This patch depends on the crypto feature added by HADOOP-9331 and others. For
this patch to compile with the Hadoop versions that does included crypto
feature, we need to add flag –Dcrypto=true to disable this feature. I do
understand that this approach still doesn’t align with the target of support
multiple versions for a single compile of Hive.
{quote}Furthermore, this seems likes the wrong direction. What is the advantage
of this rather large patch over using the cfs work? If the user defines a table
in cfs all of the table's data will be encrypted.{quote}
I agree that cfs work has its value on transparency on API and it is a good
stuff. We are working on CFS and it is currently not available yet. And the
work here is already in use by our users who are using this to protect
sensitive data on their clusters while being able to transparently decrypt the
data while running jobs that process this encrypted data.

On the other hand, we see that compression codec is already widely used for
various file formats used by Hive. The issue here may be the current approach
depends on the changes to specific file formats for handling encryption key
contexts. One possible direction is to make the encryption codec strictly the
same as compression codec so that Hive can utilize a codec doing encryption or
decryption without any changes to file formats and Hive. If we can do that, it
adds the same value as compression codec.

Support data encryption for Hive tables
---

Original Estimate: 504h
Remaining Estimate: 504h

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (HIVE-17632) Build Hive with JDK9

2017-09-28 Thread Jerry Chen (JIRA)

Jerry Chen created HIVE-17632:
-

 Summary: Build Hive with JDK9
 Key: HIVE-17632
 URL: https://issues.apache.org/jira/browse/HIVE-17632
 Project: Hive
  Issue Type: Improvement
  Components: Build Infrastructure
Affects Versions: 3.0.0
Reporter: Jerry Chen


JDK 9 has been released recently with a lot of improvements such as the support 
of AVX 512 which can bring performance benefits running on Skylake servers.
We would expect that the users will soon to try JDK9 and will build Hadoop on 
it. Currently it's not clear what issues will the user have to build Hive on 
JDK9. The JIRA can serve as the umbrella JIRA to track all these issues.

http://jdk.java.net/9/




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: Review Request 65478: HIVE-18553 VectorizedParquetReader fails after adding a new column to table

Re: Review Request 65478: HIVE-18553 VectorizedParquetReader fails after adding a new column to table

[jira] [Created] (HIVE-5207) Support data encryption for Hive tables

[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables

[jira] [Updated] (HIVE-5207) Support data encryption for Hive tables

[jira] [Updated] (HIVE-5207) Support data encryption for Hive tables

[jira] [Created] (HIVE-3934) Put tag in value for join with map reduce

[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables

[jira] [Commented] (HIVE-5207) Support data encryption for Hive tables

[jira] [Created] (HIVE-17632) Build Hive with JDK9

10 matches

Site Navigation

Mail list logo

Footer information