date:20200629

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-29 Thread Wes McKinney

On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou  wrote:
>
>
> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > hi folks,
> >
> > (cross-posting to dev@arrow and dev@parquet since there are
> > stakeholders in both places)
> >
> > It seems there are still problems at least with the C++ implementation
> > of LZ4 compression in Parquet files
> >
> > https://issues.apache.org/jira/browse/PARQUET-1241
> > https://issues.apache.org/jira/browse/PARQUET-1878
>
> I don't have any particular opinion on how to solve the LZ4 issue, but
> I'd like to mention that LZ4 and ZStandard are the two most efficient
> compression algorithms available, and they span different parts of the
> speed/compression spectrum, so it would be a pity to disable one of them.

It's true, however I think it's worse to write LZ4-compressed files
that cannot be read by other Parquet implementations (if that's what's
happening as I understand it?). If we are indeed shipping something
broken then we either should fix it or disable it until it can be
fixed.

> Regards
>
> Antoine.

[jira] [Commented] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs

2020-06-29 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148060#comment-17148060
 ] 

ASF GitHub Bot commented on PARQUET-1643:
-

samarthjain commented on pull request #671:
URL: https://github.com/apache/parquet-mr/pull/671#issuecomment-651282816


   @nandorKollar, @rdblue, @danielcweeks - if you have cycles, could you please 
take a look at this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs
> ---
>
> Key: PARQUET-1643
> URL: https://issues.apache.org/jira/browse/PARQUET-1643
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Samarth Jain
>Assignee: Samarth Jain
>Priority: Major
>  Labels: pull-request-available
>
> [~rdblue] pointed me to [https://github.com/airlift/aircompressor] which 
> provides non-native implementations of compression codecs. It claims to be 
> much faster than native wrappers that parquet uses. This Jira is to track the 
> work needed for exploring using these codecs, getting benchmark results and 
> making changes including not needing to pool compressors and decompressors 
> anymore. Note that this doesn't include SNAPPY since Parquet already has its 
> own non-hadoopy implementation for it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-mr] samarthjain commented on pull request #671: PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP

2020-06-29 Thread GitBox



samarthjain commented on pull request #671:
URL: https://github.com/apache/parquet-mr/pull/671#issuecomment-651282816


   @nandorKollar, @rdblue, @danielcweeks - if you have cycles, could you please 
take a look at this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1373) Encryption key management tools

2020-06-29 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147824#comment-17147824
 ] 

ASF GitHub Bot commented on PARQUET-1373:
-

gszadovszky commented on a change in pull request #615:
URL: https://github.com/apache/parquet-mr/pull/615#discussion_r446146320



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/KeyMaterial.java
##
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.crypto.keytools;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.parquet.crypto.ParquetCryptoRuntimeException;
+import org.codehaus.jackson.map.ObjectMapper;
+import org.codehaus.jackson.type.TypeReference;
+
+public class KeyMaterial {
+  static final String KEY_MATERIAL_TYPE_FIELD = "keyMaterialType";
+  static final String KEY_MATERIAL_TYPE = "PKMT1";
+  static final String KEY_MATERIAL_INTERNAL_STORAGE_FIELD = "internalStorage";
+
+  static final String FOOTER_KEY_ID_IN_FILE = "footerKey";
+  static final String COLUMN_KEY_ID_IN_FILE_PREFIX = "columnKey";
+  
+  private static final String IS_FOOTER_KEY_FIELD = "isFooterKey";
+  private static final String DOUBLE_WRAPPING_FIELD = "doubleWrapping";
+  private static final String KMS_INSTANCE_ID_FIELD = "kmsInstanceID";
+  private static final String KMS_INSTANCE_URL_FIELD = "kmsInstanceURL";
+  private static final String MASTER_KEY_ID_FIELD = "masterKeyID";
+  private static final String WRAPPED_DEK_FIELD = "wrappedDEK";
+  private static final String KEK_ID_FIELD = "keyEncryptionKeyID";
+  private static final String WRAPPED_KEK_FIELD = "wrappedKEK";
+
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+
+  private final boolean isFooterKey;
+  private final String kmsInstanceID;
+  private final String kmsInstanceURL;
+  private final String masterKeyID;
+  private final boolean isDoubleWrapped;
+  private final String kekID;
+  private final String encodedWrappedKEK;
+  private final String encodedWrappedDEK;
+
+  private KeyMaterial(boolean isFooterKey, String kmsInstanceID, String 
kmsInstanceURL, String masterKeyID, 
+  boolean isDoubleWrapped, String kekID, String encodedWrappedKEK, String 
encodedWrappedDEK) {
+this.isFooterKey = isFooterKey;
+this.kmsInstanceID = kmsInstanceID;
+this.kmsInstanceURL = kmsInstanceURL;
+this.masterKeyID = masterKeyID;
+this.isDoubleWrapped = isDoubleWrapped;
+this.kekID = kekID;
+this.encodedWrappedKEK = encodedWrappedKEK;
+this.encodedWrappedDEK = encodedWrappedDEK;
+  }
+
+  static KeyMaterial parse(Map keyMaterialJson) {
+boolean isFooterKey = 
Boolean.valueOf(keyMaterialJson.get(IS_FOOTER_KEY_FIELD));
+String kmsInstanceID = null;
+String kmsInstanceURL = null;
+if (isFooterKey) {
+  kmsInstanceID = keyMaterialJson.get(KMS_INSTANCE_ID_FIELD);
+  kmsInstanceURL = keyMaterialJson.get(KMS_INSTANCE_URL_FIELD);
+}
+boolean isDoubleWrapped = 
Boolean.valueOf(keyMaterialJson.get(DOUBLE_WRAPPING_FIELD));
+String masterKeyID = keyMaterialJson.get(MASTER_KEY_ID_FIELD);
+String  encodedWrappedDEK = keyMaterialJson.get(WRAPPED_DEK_FIELD);
+String kekID = null;
+String encodedWrappedKEK = null;
+if (isDoubleWrapped) {
+  kekID = keyMaterialJson.get(KEK_ID_FIELD);
+  encodedWrappedKEK = keyMaterialJson.get(WRAPPED_KEK_FIELD);
+}
+
+return new KeyMaterial(isFooterKey, kmsInstanceID, kmsInstanceURL, 
masterKeyID, isDoubleWrapped, kekID, encodedWrappedKEK, encodedWrappedDEK);
+  }
+
+  static KeyMaterial parse(String keyMaterialString) {
+Map keyMaterialJson = null;
+try {
+  keyMaterialJson = OBJECT_MAPPER.readValue(new 
StringReader(keyMaterialString),
+  new TypeReference>() {});
+} catch (IOException e) {
+  throw new ParquetCryptoRuntimeException("Failed to parse key metadata " 
+ keyMaterialString, e);
+}
+String keyMaterialType = keyMaterialJson.get(KEY_MATERIAL_TYPE_FIELD);
+if (!KEY_MATERIAL_TYPE.equals(keyMaterialType)) {
+

[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #615: PARQUET-1373: Encryption key tools

2020-06-29 Thread GitBox



gszadovszky commented on a change in pull request #615:
URL: https://github.com/apache/parquet-mr/pull/615#discussion_r446146320



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/KeyMaterial.java
##
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.crypto.keytools;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.parquet.crypto.ParquetCryptoRuntimeException;
+import org.codehaus.jackson.map.ObjectMapper;
+import org.codehaus.jackson.type.TypeReference;
+
+public class KeyMaterial {
+  static final String KEY_MATERIAL_TYPE_FIELD = "keyMaterialType";
+  static final String KEY_MATERIAL_TYPE = "PKMT1";
+  static final String KEY_MATERIAL_INTERNAL_STORAGE_FIELD = "internalStorage";
+
+  static final String FOOTER_KEY_ID_IN_FILE = "footerKey";
+  static final String COLUMN_KEY_ID_IN_FILE_PREFIX = "columnKey";
+  
+  private static final String IS_FOOTER_KEY_FIELD = "isFooterKey";
+  private static final String DOUBLE_WRAPPING_FIELD = "doubleWrapping";
+  private static final String KMS_INSTANCE_ID_FIELD = "kmsInstanceID";
+  private static final String KMS_INSTANCE_URL_FIELD = "kmsInstanceURL";
+  private static final String MASTER_KEY_ID_FIELD = "masterKeyID";
+  private static final String WRAPPED_DEK_FIELD = "wrappedDEK";
+  private static final String KEK_ID_FIELD = "keyEncryptionKeyID";
+  private static final String WRAPPED_KEK_FIELD = "wrappedKEK";
+
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+
+  private final boolean isFooterKey;
+  private final String kmsInstanceID;
+  private final String kmsInstanceURL;
+  private final String masterKeyID;
+  private final boolean isDoubleWrapped;
+  private final String kekID;
+  private final String encodedWrappedKEK;
+  private final String encodedWrappedDEK;
+
+  private KeyMaterial(boolean isFooterKey, String kmsInstanceID, String 
kmsInstanceURL, String masterKeyID, 
+  boolean isDoubleWrapped, String kekID, String encodedWrappedKEK, String 
encodedWrappedDEK) {
+this.isFooterKey = isFooterKey;
+this.kmsInstanceID = kmsInstanceID;
+this.kmsInstanceURL = kmsInstanceURL;
+this.masterKeyID = masterKeyID;
+this.isDoubleWrapped = isDoubleWrapped;
+this.kekID = kekID;
+this.encodedWrappedKEK = encodedWrappedKEK;
+this.encodedWrappedDEK = encodedWrappedDEK;
+  }
+
+  static KeyMaterial parse(Map keyMaterialJson) {
+boolean isFooterKey = 
Boolean.valueOf(keyMaterialJson.get(IS_FOOTER_KEY_FIELD));
+String kmsInstanceID = null;
+String kmsInstanceURL = null;
+if (isFooterKey) {
+  kmsInstanceID = keyMaterialJson.get(KMS_INSTANCE_ID_FIELD);
+  kmsInstanceURL = keyMaterialJson.get(KMS_INSTANCE_URL_FIELD);
+}
+boolean isDoubleWrapped = 
Boolean.valueOf(keyMaterialJson.get(DOUBLE_WRAPPING_FIELD));
+String masterKeyID = keyMaterialJson.get(MASTER_KEY_ID_FIELD);
+String  encodedWrappedDEK = keyMaterialJson.get(WRAPPED_DEK_FIELD);
+String kekID = null;
+String encodedWrappedKEK = null;
+if (isDoubleWrapped) {
+  kekID = keyMaterialJson.get(KEK_ID_FIELD);
+  encodedWrappedKEK = keyMaterialJson.get(WRAPPED_KEK_FIELD);
+}
+
+return new KeyMaterial(isFooterKey, kmsInstanceID, kmsInstanceURL, 
masterKeyID, isDoubleWrapped, kekID, encodedWrappedKEK, encodedWrappedDEK);
+  }
+
+  static KeyMaterial parse(String keyMaterialString) {
+Map keyMaterialJson = null;
+try {
+  keyMaterialJson = OBJECT_MAPPER.readValue(new 
StringReader(keyMaterialString),
+  new TypeReference>() {});
+} catch (IOException e) {
+  throw new ParquetCryptoRuntimeException("Failed to parse key metadata " 
+ keyMaterialString, e);
+}
+String keyMaterialType = keyMaterialJson.get(KEY_MATERIAL_TYPE_FIELD);
+if (!KEY_MATERIAL_TYPE.equals(keyMaterialType)) {
+  throw new ParquetCryptoRuntimeException("Wrong key material type: " + 
keyMaterialType + 
+  " vs " + KEY_MATERIAL_TYPE);
+}
+return parse(keyMaterialJson);
+  }
+
+  static String createSerialized(boolean isFooterKey, String

Announcing ApacheCon @Home 2020

2020-06-29 Thread Rich Bowen


Hi, Apache enthusiast!

(You’re receiving this because you’re subscribed to one or more dev or 
user mailing lists for an Apache Software Foundation project.)


The ApacheCon Planners and the Apache Software Foundation are pleased to 
announce that ApacheCon @Home will be held online, September 29th 
through October 1st, 2020. We’ll be featuring content from dozens of our 
projects, as well as content about community, how Apache works, business 
models around Apache software, the legal aspects of open source, and 
many other topics.


Full details about the event, and registration, is available at 
https://apachecon.com/acah2020


Due to the confusion around how and where this event was going to be 
held, and in order to open up to presenters from around the world who 
may previously have been unable or unwilling to travel, we’ve reopened 
the Call For Presentations until July 13th. Submit your talks today at 
https://acna2020.jamhosted.net/


We hope to see you at the event!
Rich Bowen, VP Conferences, The Apache Software Foundation

[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field

2020-06-29 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147611#comment-17147611
 ] 

ASF GitHub Bot commented on PARQUET-1879:
-

maccamlc commented on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-651009716


   > @maccamlc,
   > 
   > The main problem I think is that the spec does not say anything about how 
the thrift objects shall be used. The specification is about the semantics of 
the schema and it is described using the parquet schema _language_. But, in the 
file there is no such _language_, we only have [thrift 
objects](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
   > When the specification says something about the _logical types_ (e.g. 
`MAP`) it does not say anything about which thrift structure should be used 
(the converted type 
[`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L53)
 or the logical type 
[`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L324)).
   > We added the new logical type structures in the thrift to support enhanced 
ways to specify _logical types_ (e.g. 
[`TimeStampType`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L272)).
 The idea for backward compatibility was to write the old converted types 
wherever it make sense (the semantics of the actual _logical type_ is the same 
as was before) along with the new logical type structures. So, related to 
`MAP_KEY_VALUE`, I think, we shall write it at the correct place if it was 
written before (prior to `1.11.0`) and it helps for other readers but do not 
expect it to be there.
   > 
   > Cheers,
   > Gabor
   
   Sounds good @gszadovszky . Thanks for some clarification.
   
   Therefore, depending on any other comments from other reviewers, it seems 
this PR is still ready to merge as-is :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with 
> a Map field
> -
>
> Key: PARQUET-1879
> URL: https://issues.apache.org/jira/browse/PARQUET-1879
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro, parquet-format
>Affects Versions: 1.11.0
>Reporter: Matthew McMahon
>Priority: Critical
>
> From my 
> [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with]
>  in relation to an issue I'm having with getting Snowflake (Cloud DB) to load 
> Parquet files written with version 1.11.0
> 
> The problem only appears when using a map schema field in the Avro schema. 
> For example:
> {code:java}
> {
>   "name": "FeatureAmounts",
>   "type": {
> "type": "map",
> "values": "records.MoneyDecimal"
>   }
> }
> {code}
> When using Parquet-Avro to write the file, a bad Parquet schema ends up with, 
> for example
> {code:java}
> message record.ResponseRecord {
>   required binary GroupId (STRING);
>   required int64 EntryTime (TIMESTAMP(MILLIS,true));
>   required int64 HandlingDuration;
>   required binary Id (STRING);
>   optional binary ResponseId (STRING);
>   required binary RequestId (STRING);
>   optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15));
>   required group FeatureAmounts (MAP) {
> repeated group map (MAP_KEY_VALUE) {
>   required binary key (STRING);
>   required fixed_len_byte_array(12) value (DECIMAL(28,15));
> }
>   }
> }
> {code}
> From the great answer to my StackOverflow, it seems the issue is that the 
> 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, 
> that has no logical type equivalent. From the comment on 
> [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904]
> {code:java}
> // This logical type annotation is implemented to support backward 
> compatibility with ConvertedType.
>   // The new logical type representation in parquet-format doesn't have any 
> key-value type,
>   // thus this annotation is mapped to UNKNOWN. This type shouldn't be used.
> {code}
> However, it seems this is being written with the latest 1.11.0, which then 
> causes Apache Arrow to fail with
> {code:java}
> Logical type Null can not be applied to group node
> {code}
> As it appears

[GitHub] [parquet-mr] maccamlc commented on pull request #798: PARQUET-1879 MapKeyValue is not a valid Logical Type

2020-06-29 Thread GitBox



maccamlc commented on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-651009716


   > @maccamlc,
   > 
   > The main problem I think is that the spec does not say anything about how 
the thrift objects shall be used. The specification is about the semantics of 
the schema and it is described using the parquet schema _language_. But, in the 
file there is no such _language_, we only have [thrift 
objects](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
   > When the specification says something about the _logical types_ (e.g. 
`MAP`) it does not say anything about which thrift structure should be used 
(the converted type 
[`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L53)
 or the logical type 
[`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L324)).
   > We added the new logical type structures in the thrift to support enhanced 
ways to specify _logical types_ (e.g. 
[`TimeStampType`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L272)).
 The idea for backward compatibility was to write the old converted types 
wherever it make sense (the semantics of the actual _logical type_ is the same 
as was before) along with the new logical type structures. So, related to 
`MAP_KEY_VALUE`, I think, we shall write it at the correct place if it was 
written before (prior to `1.11.0`) and it helps for other readers but do not 
expect it to be there.
   > 
   > Cheers,
   > Gabor
   
   Sounds good @gszadovszky . Thanks for some clarification.
   
   Therefore, depending on any other comments from other reviewers, it seems 
this PR is still ready to merge as-is :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field

2020-06-29 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147599#comment-17147599
 ] 

ASF GitHub Bot commented on PARQUET-1879:
-

gszadovszky commented on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650992678


   @maccamlc,
   
   The main problem I think is that the spec does not say anything about how 
the thrift objects shall be used. The specification is about the semantics of 
the schema and it is described using the parquet schema _language_. But, in the 
file there is no such _language_, we only have [thrift 
objects](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
   When the specification says something about the _logical types_ (e.g. `MAP`) 
it does not say anything about which thrift structure should be used (the 
converted type 
[`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L53)
 or the logical type 
[`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L324)).
   We added the new logical type structures in the thrift to support enhanced 
ways to specify _logical types_ (e.g. 
[`TimeStampType`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L272)).
 The idea for backward compatibility was to write the old converted types 
wherever it make sense (the semantics of the actual _logical type_ is the same 
as was before) along with the new logical type structures. So, related to 
`MAP_KEY_VALUE`, I think, we shall write it at the correct place if it was 
written before (prior to `1.11.0`) and it helps for other readers but do not 
expect it to be there.
   
   Cheers,
   Gabor



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with 
> a Map field
> -
>
> Key: PARQUET-1879
> URL: https://issues.apache.org/jira/browse/PARQUET-1879
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro, parquet-format
>Affects Versions: 1.11.0
>Reporter: Matthew McMahon
>Priority: Critical
>
> From my 
> [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with]
>  in relation to an issue I'm having with getting Snowflake (Cloud DB) to load 
> Parquet files written with version 1.11.0
> 
> The problem only appears when using a map schema field in the Avro schema. 
> For example:
> {code:java}
> {
>   "name": "FeatureAmounts",
>   "type": {
> "type": "map",
> "values": "records.MoneyDecimal"
>   }
> }
> {code}
> When using Parquet-Avro to write the file, a bad Parquet schema ends up with, 
> for example
> {code:java}
> message record.ResponseRecord {
>   required binary GroupId (STRING);
>   required int64 EntryTime (TIMESTAMP(MILLIS,true));
>   required int64 HandlingDuration;
>   required binary Id (STRING);
>   optional binary ResponseId (STRING);
>   required binary RequestId (STRING);
>   optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15));
>   required group FeatureAmounts (MAP) {
> repeated group map (MAP_KEY_VALUE) {
>   required binary key (STRING);
>   required fixed_len_byte_array(12) value (DECIMAL(28,15));
> }
>   }
> }
> {code}
> From the great answer to my StackOverflow, it seems the issue is that the 
> 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, 
> that has no logical type equivalent. From the comment on 
> [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904]
> {code:java}
> // This logical type annotation is implemented to support backward 
> compatibility with ConvertedType.
>   // The new logical type representation in parquet-format doesn't have any 
> key-value type,
>   // thus this annotation is mapped to UNKNOWN. This type shouldn't be used.
> {code}
> However, it seems this is being written with the latest 1.11.0, which then 
> causes Apache Arrow to fail with
> {code:java}
> Logical type Null can not be applied to group node
> {code}
> As it appears that 
> [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632]
>  only looks for the new logical type of Map or List, therefore this causes an 
> error.
> I have seen in

[GitHub] [parquet-mr] gszadovszky commented on pull request #798: PARQUET-1879 MapKeyValue is not a valid Logical Type

2020-06-29 Thread GitBox

gszadovszky commented on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650992678

@maccamlc,

The main problem I think is that the spec does not say anything about how
the thrift objects shall be used. The specification is about the semantics of
the schema and it is described using the parquet schema _language_. But, in the
file there is no such _language_, we only have [thrift
objects](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
When the specification says something about the _logical types_ (e.g. `MAP`)
it does not say anything about which thrift structure should be used (the
converted type
[`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L53)
or the logical type
[`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L324)).
We added the new logical type structures in the thrift to support enhanced
ways to specify _logical types_ (e.g.
[`TimeStampType`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L272)).
The idea for backward compatibility was to write the old converted types
wherever it make sense (the semantics of the actual _logical type_ is the same
as was before) along with the new logical type structures. So, related to
`MAP_KEY_VALUE`, I think, we shall write it at the correct place if it was
written before (prior to `1.11.0`) and it helps for other readers but do not
expect it to be there.

Cheers,
Gabor

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs

2020-06-29 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147587#comment-17147587
 ] 

ASF GitHub Bot commented on PARQUET-1643:
-

samarthjain commented on pull request #671:
URL: https://github.com/apache/parquet-mr/pull/671#issuecomment-650971182


   @dbtsai 
   > Since airlift is pure Java implementation, what's the performance 
implications for zstd? I saw there is a benchmark for GZIP, but I don't see 
benchmark for other codecs.
   It looks like the zstd Airlift implementation doesn't implement the Hadoop 
APIs. It can be integrated within Parquet, but will take some work worth 
definitely worthy of another PR.

   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs
> ---
>
> Key: PARQUET-1643
> URL: https://issues.apache.org/jira/browse/PARQUET-1643
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Samarth Jain
>Assignee: Samarth Jain
>Priority: Major
>  Labels: pull-request-available
>
> [~rdblue] pointed me to [https://github.com/airlift/aircompressor] which 
> provides non-native implementations of compression codecs. It claims to be 
> much faster than native wrappers that parquet uses. This Jira is to track the 
> work needed for exploring using these codecs, getting benchmark results and 
> making changes including not needing to pool compressors and decompressors 
> anymore. Note that this doesn't include SNAPPY since Parquet already has its 
> own non-hadoopy implementation for it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-mr] samarthjain commented on pull request #671: PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP

2020-06-29 Thread GitBox



samarthjain commented on pull request #671:
URL: https://github.com/apache/parquet-mr/pull/671#issuecomment-650971182


   @dbtsai 
   > Since airlift is pure Java implementation, what's the performance 
implications for zstd? I saw there is a benchmark for GZIP, but I don't see 
benchmark for other codecs.
   It looks like the zstd Airlift implementation doesn't implement the Hadoop 
APIs. It can be integrated within Parquet, but will take some work worth 
definitely worthy of another PR.

   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

[jira] [Commented] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs

[GitHub] [parquet-mr] samarthjain commented on pull request #671: PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP

[jira] [Commented] (PARQUET-1373) Encryption key management tools

[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #615: PARQUET-1373: Encryption key tools

Announcing ApacheCon @Home 2020

[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field

[GitHub] [parquet-mr] maccamlc commented on pull request #798: PARQUET-1879 MapKeyValue is not a valid Logical Type

[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field

[GitHub] [parquet-mr] gszadovszky commented on pull request #798: PARQUET-1879 MapKeyValue is not a valid Logical Type

[jira] [Commented] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs

[GitHub] [parquet-mr] samarthjain commented on pull request #671: PARQUET-1643 Use airlift codecs for LZ4, LZ0, GZIP

12 matches

Site Navigation

Mail list logo

Footer information