[GitHub] [parquet-mr] shangxinli opened a new pull request #796: Parquet-1872: Add TransCompression command to parquet-tools

2020-06-11 Thread GitBox


shangxinli opened a new pull request #796:
URL: https://github.com/apache/parquet-mr/pull/796


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (PARQUET-1875) Add bloom filter support

2020-06-11 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1875:


 Summary: Add bloom filter support 
 Key: PARQUET-1875
 URL: https://issues.apache.org/jira/browse/PARQUET-1875
 Project: Parquet
  Issue Type: Sub-task
Reporter: Xinli Shang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1873) Add to Parquet-tools

2020-06-11 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1873:


 Summary: Add to Parquet-tools 
 Key: PARQUET-1873
 URL: https://issues.apache.org/jira/browse/PARQUET-1873
 Project: Parquet
  Issue Type: Sub-task
Reporter: Xinli Shang
Assignee: Xinli Shang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1874) Add to parquet-cli

2020-06-11 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1874:


 Summary: Add to parquet-cli
 Key: PARQUET-1874
 URL: https://issues.apache.org/jira/browse/PARQUET-1874
 Project: Parquet
  Issue Type: Sub-task
Reporter: Xinli Shang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1872) Add TransCompression command

2020-06-11 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1872:


 Summary: Add TransCompression command 
 Key: PARQUET-1872
 URL: https://issues.apache.org/jira/browse/PARQUET-1872
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.0
Reporter: Xinli Shang
Assignee: Xinli Shang


When ZSTD becomes more popular, there is a need to translate existing data ZSTD 
compressed which can achieve a higher compression ratio. It would be useful if 
we can have a tool to convert a Parquet file directly by just 
decompressing/compressing each page without decoding/encoding or assembling the 
record because it is much faster. The initial result shows it is ~5 times 
faster. 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1821) Add 'column-size' command to parquet-cli and parquet-tools

2020-06-11 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-1821.
--
Resolution: Fixed

> Add 'column-size' command to parquet-cli and parquet-tools
> --
>
> Key: PARQUET-1821
> URL: https://issues.apache.org/jira/browse/PARQUET-1821
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> To determine which column to be removed when running the 'prune' command, one 
> of the factors is to check the size of each column. This Jira is to add a 
> command to get the size in both bytes and the percentage of each column. It 
> can be considered as the width of each column. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1373) Encryption key management tools

2020-06-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1715#comment-1715
 ] 

ASF GitHub Bot commented on PARQUET-1373:
-

gszadovszky commented on a change in pull request #615:
URL: https://github.com/apache/parquet-mr/pull/615#discussion_r435236094



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/FileKeyUnwrapper.java
##
@@ -0,0 +1,216 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+package org.apache.parquet.crypto.keytools;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.nio.charset.StandardCharsets;
+import java.util.Base64;
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentMap;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.parquet.crypto.DecryptionKeyRetriever;
+import org.apache.parquet.crypto.ParquetCryptoRuntimeException;
+import org.apache.parquet.crypto.keytools.KeyToolkit.KeyWithMasterID;
+import org.codehaus.jackson.map.ObjectMapper;
+import org.codehaus.jackson.type.TypeReference;
+
+import static org.apache.parquet.crypto.keytools.KeyToolkit.stringIsEmpty;
+
+public class FileKeyUnwrapper implements DecryptionKeyRetriever {
+  // For every token: a map of KEK_ID to KEK bytes
+  private static final ConcurrentMap>> KEKMapPerToken = new 
ConcurrentHashMap<>();

Review comment:
   As per the java naming conventions for constants (`static final`):
   ```suggestion
 private static final ConcurrentMap>> KEK_MAP_PER_TOKEN = new 
ConcurrentHashMap<>();
   
   ```

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/FileKeyUnwrapper.java
##
@@ -0,0 +1,216 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+package org.apache.parquet.crypto.keytools;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.nio.charset.StandardCharsets;
+import java.util.Base64;
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentMap;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.parquet.crypto.DecryptionKeyRetriever;
+import org.apache.parquet.crypto.ParquetCryptoRuntimeException;
+import org.apache.parquet.crypto.keytools.KeyToolkit.KeyWithMasterID;
+import org.codehaus.jackson.map.ObjectMapper;
+import org.codehaus.jackson.type.TypeReference;
+
+import static org.apache.parquet.crypto.keytools.KeyToolkit.stringIsEmpty;
+
+public class FileKeyUnwrapper implements DecryptionKeyRetriever {
+  // For every token: a map of KEK_ID to KEK bytes
+  private static final ConcurrentMap>> KEKMapPerToken = new 
ConcurrentHashMap<>();
+  private volatile static long lastKekCacheCleanupTimestamp = 
System.currentTimeMillis() + 60l * 1000; // grace period of 1 minute
+  //A map of KEK_ID to KEK - for the current token
+  private final ConcurrentMap KEKPerKekID;

Review comment:
   nit: should start with lower case. Though, the naming conventions for 
acronyms are not clear I would vote on the following.
   ```suggestion
 private final ConcurrentMap kekPerKekID;
   ```

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/FileKeyWrapper.java
##
@@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under 

[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #615: PARQUET-1373: Encryption key tools

2020-06-11 Thread GitBox


gszadovszky commented on a change in pull request #615:
URL: https://github.com/apache/parquet-mr/pull/615#discussion_r435236094



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/FileKeyUnwrapper.java
##
@@ -0,0 +1,216 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+package org.apache.parquet.crypto.keytools;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.nio.charset.StandardCharsets;
+import java.util.Base64;
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentMap;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.parquet.crypto.DecryptionKeyRetriever;
+import org.apache.parquet.crypto.ParquetCryptoRuntimeException;
+import org.apache.parquet.crypto.keytools.KeyToolkit.KeyWithMasterID;
+import org.codehaus.jackson.map.ObjectMapper;
+import org.codehaus.jackson.type.TypeReference;
+
+import static org.apache.parquet.crypto.keytools.KeyToolkit.stringIsEmpty;
+
+public class FileKeyUnwrapper implements DecryptionKeyRetriever {
+  // For every token: a map of KEK_ID to KEK bytes
+  private static final ConcurrentMap>> KEKMapPerToken = new 
ConcurrentHashMap<>();

Review comment:
   As per the java naming conventions for constants (`static final`):
   ```suggestion
 private static final ConcurrentMap>> KEK_MAP_PER_TOKEN = new 
ConcurrentHashMap<>();
   
   ```

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/FileKeyUnwrapper.java
##
@@ -0,0 +1,216 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+package org.apache.parquet.crypto.keytools;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.nio.charset.StandardCharsets;
+import java.util.Base64;
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentMap;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.parquet.crypto.DecryptionKeyRetriever;
+import org.apache.parquet.crypto.ParquetCryptoRuntimeException;
+import org.apache.parquet.crypto.keytools.KeyToolkit.KeyWithMasterID;
+import org.codehaus.jackson.map.ObjectMapper;
+import org.codehaus.jackson.type.TypeReference;
+
+import static org.apache.parquet.crypto.keytools.KeyToolkit.stringIsEmpty;
+
+public class FileKeyUnwrapper implements DecryptionKeyRetriever {
+  // For every token: a map of KEK_ID to KEK bytes
+  private static final ConcurrentMap>> KEKMapPerToken = new 
ConcurrentHashMap<>();
+  private volatile static long lastKekCacheCleanupTimestamp = 
System.currentTimeMillis() + 60l * 1000; // grace period of 1 minute
+  //A map of KEK_ID to KEK - for the current token
+  private final ConcurrentMap KEKPerKekID;

Review comment:
   nit: should start with lower case. Though, the naming conventions for 
acronyms are not clear I would vote on the following.
   ```suggestion
 private final ConcurrentMap kekPerKekID;
   ```

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/FileKeyWrapper.java
##
@@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+