[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540356#comment-16540356
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva closed pull request #1370: DRILL-5797: Use Parquet new reader 
in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 
b/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
index dc09ce1b695..47f2e18a9a7 100644
--- 
a/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
+++ 
b/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
@@ -25,22 +25,21 @@
 import org.apache.drill.exec.ops.ExecutorFragmentContext;
 import org.apache.drill.exec.ops.OperatorContext;
 import org.apache.drill.exec.physical.impl.ScanBatch;
+import org.apache.drill.exec.proto.helper.QueryIdHelper;
 import org.apache.drill.exec.server.options.OptionManager;
 import org.apache.drill.exec.store.ColumnExplorer;
 import org.apache.drill.exec.store.RecordReader;
 import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
 import org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader;
 import org.apache.drill.exec.store.parquet2.DrillParquetReader;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.parquet.ParquetReadOptions;
-import org.apache.parquet.column.ColumnDescriptor;
 import org.apache.parquet.hadoop.CodecFactory;
 import org.apache.parquet.hadoop.ParquetFileReader;
 import org.apache.parquet.hadoop.metadata.ParquetMetadata;
 import org.apache.parquet.hadoop.util.HadoopInputFile;
-import org.apache.parquet.schema.MessageType;
-import org.apache.parquet.schema.Type;
 
 import java.io.IOException;
 import java.util.ArrayList;
@@ -107,7 +106,10 @@ protected ScanBatch getBatch(ExecutorFragmentContext 
context, AbstractParquetRow
   ParquetReaderUtility.detectCorruptDates(footer, 
rowGroupScan.getColumns(), autoCorrectCorruptDates);
 logger.debug("Contains corrupt dates: {}", containsCorruptDates);
 
-if 
(!context.getOptions().getBoolean(ExecConstants.PARQUET_NEW_RECORD_READER) && 
!isComplex(footer)) {
+if 
(!context.getOptions().getBoolean(ExecConstants.PARQUET_NEW_RECORD_READER)
+&& !ParquetReaderUtility.containsComplexColumn(footer, 
rowGroupScan.getColumns())) {
+  logger.debug("Query {} qualifies for new Parquet reader",
+  
QueryIdHelper.getQueryId(oContext.getFragmentContext().getHandle().getQueryId()));
   readers.add(new ParquetRecordReader(context,
   rowGroup.getPath(),
   rowGroup.getRowGroupIndex(),
@@ -118,6 +120,8 @@ protected ScanBatch getBatch(ExecutorFragmentContext 
context, AbstractParquetRow
   rowGroupScan.getColumns(),
   containsCorruptDates));
 } else {
+  logger.debug("Query {} doesn't qualify for new reader, using old 
one",
+  
QueryIdHelper.getQueryId(oContext.getFragmentContext().getHandle().getQueryId()));
   readers.add(new DrillParquetReader(context,
   footer,
   rowGroup,
@@ -161,22 +165,6 @@ private ParquetMetadata readFooter(Configuration conf, 
String path) throws IOExc
 }
   }
 
-  private boolean isComplex(ParquetMetadata footer) {
-MessageType schema = footer.getFileMetaData().getSchema();
-
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
-  }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
-  }
-}
-return false;
-  }
-
   /**
* Helper class responsible for creating and managing DrillFileSystem.
*/
@@ -190,5 +178,4 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
-
 }
diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 
b/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
index a7f78fb40c6..6960b35a031 100644
--- 
a/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
+++ 

[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540171#comment-16540171
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201714214
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestComplexColumnInSchema.java
 ##
 @@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+
+/**
+ * This test checks correctness of complex column detection in the Parquet 
file schema.
+ */
+public class TestComplexColumnInSchema {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex_special_cases.parquet";
+  private static ParquetMetadata footer;
+  List columns;
 
 Review comment:
   This and all other outstanding comments have been addressed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540051#comment-16540051
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201675603
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertNull;
+import static org.junit.Assert.assertTrue;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setup() throws IOException {
+Configuration conf = new Configuration();
+
+footer = ParquetFileReader.readFooter(conf, new Path(path));
+assertNotNull("footer must be not null", footer);
 
 Review comment:
   You are checking Footer for null, here but not in prev test.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540052#comment-16540052
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201675695
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertNull;
+import static org.junit.Assert.assertTrue;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setup() throws IOException {
+Configuration conf = new Configuration();
+
+footer = ParquetFileReader.readFooter(conf, new Path(path));
+assertNotNull("footer must be not null", footer);
+  }
+
+  @Test
+  public void testSchemaElementsMap() {
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+assertTrue(schemaElements.size() == 14);
+
+SchemaElement schemaElement = 
schemaElements.get("`marketing_info`.`camp_id`");
+assertNotNull("Schema element must be not null", schemaElement);
+assertTrue("camp_id".equals(schemaElement.getName()));
+
+schemaElement = schemaElements.get("`marketing_info`");
+assertNotNull("Schema element must be not null", schemaElement);
+assertEquals("Schema element name match lookup key", 
schemaElement.getName(), "marketing_info");
+  }
+
+  @Test
+  public void testColumnDescriptorMap() {
+Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+assertTrue(colDescMap.size() == 11);
 
 Review comment:
   assertEquals...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540055#comment-16540055
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201676203
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestComplexColumnInSchema.java
 ##
 @@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+
+/**
+ * This test checks correctness of complex column detection in the Parquet 
file schema.
+ */
+public class TestComplexColumnInSchema {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex_special_cases.parquet";
 
 Review comment:
   You can add description of the schema and data in the comment.
   Example: 
https://github.com/apache/drill/blob/9a6cb59b9b7a5b127e5f60309ce2f506ede9652a/exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetFilterPushDownForComplexTypes.java


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540057#comment-16540057
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201677000
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -525,4 +599,57 @@ public static long getDateTimeValueFromBinary(Binary 
binaryTimeStampValue, boole
 }
   }
 
+  /**
+   * Check whether any of columns in the given list is either nested or 
repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+MessageType schema = footer.getFileMetaData().getSchema();
+
+if (Utilities.isStarQuery(columns)) {
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
+  }
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
+  }
+  return false;
+} else {
+  Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+  Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+  for (SchemaPath schemaPath : columns) {
+// Schema path which is non-leaf is complex column
+if (!schemaPath.isLeaf()) {
+  logger.debug("rowGroupScan contains complex column: {}", 
schemaPath.getUnIndexed().toString());
+  return true;
+}
+
+// following column descriptor lookup failure may mean two cases, 
depending on subsequent SchemaElement lookup:
+// 1. success: queried column is complex, i.e. GroupType
+// 2. failure: queried column is not in schema and thus is non-complex
+ColumnDescriptor column = 
colDescMap.get(schemaPath.getUnIndexed().toString().toLowerCase());
+
+if (column == null) {
+  SchemaElement schemaElement = 
schemaElements.get(schemaPath.getUnIndexed().toString().toLowerCase());
+  if (schemaElement != null) {
+return true;
+  }
+} else {
+  if (column.getMaxRepetitionLevel() > 0) {
+logger.debug("rowGroupScan contains repetitive column: {}", 
schemaPath.getUnIndexed().toString());
 
 Review comment:
   trace?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540058#comment-16540058
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201677035
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -525,4 +599,57 @@ public static long getDateTimeValueFromBinary(Binary 
binaryTimeStampValue, boole
 }
   }
 
+  /**
+   * Check whether any of columns in the given list is either nested or 
repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+MessageType schema = footer.getFileMetaData().getSchema();
+
+if (Utilities.isStarQuery(columns)) {
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
+  }
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
+  }
+  return false;
+} else {
+  Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+  Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+  for (SchemaPath schemaPath : columns) {
+// Schema path which is non-leaf is complex column
+if (!schemaPath.isLeaf()) {
+  logger.debug("rowGroupScan contains complex column: {}", 
schemaPath.getUnIndexed().toString());
 
 Review comment:
   trace?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540053#comment-16540053
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201676485
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +142,88 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
+
+// First element in collection is default `root` element. We skip it to 
maintain key in `a` format instead of `root`.`a`,
+// and thus to avoid the need to cut it out again when comparing with 
SchemaPath string representation
+if (iter.hasNext()) {
+  iter.next();
+}
+while (iter.hasNext()) {
+  addSchemaElementMapping(iter, new StringBuilder(), schemaElements);
 }
 return schemaElements;
   }
 
+  /**
+   * Populate full path to SchemaElement map by recursively traversing schema 
elements referenced by the given iterator
+   *
+   * @param iter file schema values iterator
+   * @param path parent schema element path
+   * @param schemaElements schema elements map to insert next iterator element 
into
+   */
+  private static void addSchemaElementMapping(Iterator iter, 
StringBuilder path,
+  Map schemaElements) {
+SchemaElement schemaElement = iter.next();
+path.append('`').append(schemaElement.getName().toLowerCase()).append('`');
+schemaElements.put(path.toString(), schemaElement);
+
+int remainingChildren = schemaElement.getNum_children();
 
 Review comment:
   Please add comment why remaining children are important in while loop.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540056#comment-16540056
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201676752
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -107,7 +105,9 @@ protected ScanBatch getBatch(ExecutorFragmentContext 
context, AbstractParquetRow
   ParquetReaderUtility.detectCorruptDates(footer, 
rowGroupScan.getColumns(), autoCorrectCorruptDates);
 logger.debug("Contains corrupt dates: {}", containsCorruptDates);
 
-if 
(!context.getOptions().getBoolean(ExecConstants.PARQUET_NEW_RECORD_READER) && 
!isComplex(footer)) {
+if 
(!context.getOptions().getBoolean(ExecConstants.PARQUET_NEW_RECORD_READER)
+&& !ParquetReaderUtility.containsComplexColumn(footer, 
rowGroupScan.getColumns())) {
+  logger.debug("Query qualifies for new Parquet reader");
 
 Review comment:
   Which query? We might add query id or something to help identification in 
the log.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540054#comment-16540054
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201675762
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertNull;
+import static org.junit.Assert.assertTrue;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setup() throws IOException {
+Configuration conf = new Configuration();
+
+footer = ParquetFileReader.readFooter(conf, new Path(path));
+assertNotNull("footer must be not null", footer);
+  }
+
+  @Test
+  public void testSchemaElementsMap() {
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+assertTrue(schemaElements.size() == 14);
 
 Review comment:
   assertEquals...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540050#comment-16540050
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201675381
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestComplexColumnInSchema.java
 ##
 @@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+
+/**
+ * This test checks correctness of complex column detection in the Parquet 
file schema.
+ */
+public class TestComplexColumnInSchema {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex_special_cases.parquet";
+  private static ParquetMetadata footer;
+  List columns;
 
 Review comment:
   No need to have it as class variable. Let's each method have it's own 
columns variable.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540025#comment-16540025
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201672191
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testParquetReaderDecision() {
 
 Review comment:
   Done, schema is:
   ```
   file schema:   root
   

   id:OPTIONAL INT64 R:0 D:1
   a: OPTIONAL BINARY O:UTF8 R:0 D:1
   repeated:  REPEATED INT64 R:1 D:1
   VariableCase:  OPTIONAL BINARY O:UTF8 R:0 D:1
   nested:OPTIONAL F:3
   .id:   OPTIONAL INT64 R:0 D:2
   .repeated: REPEATED INT64 R:1 D:2
   .VaRiAbLeCaSe: OPTIONAL BINARY O:UTF8 R:0 D:2
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539986#comment-16539986
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201663784
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
 
 Review comment:
   Method moved to ParquetReaderUtility as per discussion below.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539985#comment-16539985
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201663590
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +140,87 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
+
+// skip first default 'root' element
+if (iter.hasNext()) {
+  iter.next();
+}
+while (iter.hasNext()) {
+  addSchemaElementMapping(iter, new StringBuilder(), schemaElements);
 }
 return schemaElements;
   }
 
+  /**
+   * Populate full path to SchemaElement map by recursively traversing schema 
elements referenced by the given iterator
+   *
+   * @param iter file schema values iterator
+   * @param path parent schema element path
+   * @param schemaElements schema elements map to insert next iterator element 
into
+   */
+  private static void addSchemaElementMapping(Iterator iter, StringBuilder 
path,
+  Map schemaElements) {
+SchemaElement se = (SchemaElement)iter.next();
 
 Review comment:
   Fixed both


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539983#comment-16539983
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201663491
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
 
 Review comment:
   Exception handling removed as per above comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539984#comment-16539984
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201663521
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +140,87 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
+
+// skip first default 'root' element
+if (iter.hasNext()) {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539979#comment-16539979
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201663246
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testSchemaElementsMap() {
+assertTrue(footer != null);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+assertTrue(schemaElements.size() == 14);
+
+SchemaElement se = schemaElements.get("`marketing_info`.`camp_id`");
+assertTrue(se != null);
+assertTrue("camp_id".equals(se.getName()));
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539981#comment-16539981
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201663328
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
 
 Review comment:
   Fixed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539978#comment-16539978
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201663214
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testSchemaElementsMap() {
+assertTrue(footer != null);
 
 Review comment:
   Removed the check and let failure to be handled via throwing the exception 
from setup method.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539976#comment-16539976
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201662880
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testParquetReaderDecision() {
+List caseOldReader1 = new ArrayList<>();
+List caseOldReader2 = new ArrayList<>();
+List caseOldReader3 = new ArrayList<>();
+List caseNewReader1 = new ArrayList<>();
+List caseNewReader2 = new ArrayList<>();
+List caseNewReader3 = new ArrayList<>();
+
+SchemaPath topNestedPath = SchemaPath.getCompoundPath("marketing_info");
+SchemaPath nestedColumnPath = SchemaPath.getCompoundPath("marketing_info", 
"camp_id");
+SchemaPath topPath1 = SchemaPath.getCompoundPath("trans_id");
+SchemaPath topPath2 = SchemaPath.getCompoundPath("amount");
+SchemaPath nonExistentPath = SchemaPath.getCompoundPath("nonexistent");
+
+caseOldReader1.add(topNestedPath);
+caseOldReader2.add(nestedColumnPath);
+caseOldReader3.add(topPath1);
+caseOldReader3.add(nestedColumnPath);
+
+caseNewReader1.add(topPath1);
+caseNewReader2.add(topPath1);
+caseNewReader2.add(topPath2);
+
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader1));
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader2));
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader3));
+
+assertFalse("No complex column, but complex column is detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseNewReader1));
 
 Review comment:
   `containsComplexColumn` method moved to ParquetReaderUtility class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: 

[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539973#comment-16539973
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201662542
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
 
 Review comment:
   Removed exception handling as failure reason would indeed be obvious.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539969#comment-16539969
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201662352
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
 
 Review comment:
   Class renamed to `TestComplexColumnInSchema`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539964#comment-16539964
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201661923
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+for (SchemaPath schemaPath : columns) {
+  // non-nested column check: full path must be equal to root segment path
+  if (!schemaPath.getUnIndexed().toString().replaceAll("`", "")
+  .equalsIgnoreCase(schemaPath.getRootSegment().getPath())) {
+logger.debug("Forcing 'old' reader due to nested column: {}", 
schemaPath.getUnIndexed().toString());
+return true;
+  }
+
+  // following column descriptor lookup failure may mean two cases, 
depending on subsequent SchemaElement lookup:
+  // 1. success: queried column is complex => use old reader
+  // 2. failure: queried column is not in schema => use new reader
+  ColumnDescriptor column = 
colDescMap.get(schemaPath.getUnIndexed().toString().toLowerCase());
+
+  if (column == null) {
+SchemaElement se = 
schemaElements.get(schemaPath.getUnIndexed().toString().toLowerCase());
 
 Review comment:
   Fixed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539963#comment-16539963
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201661782
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+for (SchemaPath schemaPath : columns) {
+  // non-nested column check: full path must be equal to root segment path
+  if (!schemaPath.getUnIndexed().toString().replaceAll("`", "")
+  .equalsIgnoreCase(schemaPath.getRootSegment().getPath())) {
+logger.debug("Forcing 'old' reader due to nested column: {}", 
schemaPath.getUnIndexed().toString());
+return true;
+  }
+
+  // following column descriptor lookup failure may mean two cases, 
depending on subsequent SchemaElement lookup:
+  // 1. success: queried column is complex => use old reader
+  // 2. failure: queried column is not in schema => use new reader
+  ColumnDescriptor column = 
colDescMap.get(schemaPath.getUnIndexed().toString().toLowerCase());
+
+  if (column == null) {
+SchemaElement se = 
schemaElements.get(schemaPath.getUnIndexed().toString().toLowerCase());
+if (se != null) {
+  return true;
+}
+  } else {
+if (column.getMaxRepetitionLevel() > 0) {
+  logger.debug("Forcing 'old' reader due to repetitive column: {}", 
schemaPath.getUnIndexed().toString());
 
 Review comment:
   Fixed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539960#comment-16539960
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201661554
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+for (SchemaPath schemaPath : columns) {
+  // non-nested column check: full path must be equal to root segment path
+  if (!schemaPath.getUnIndexed().toString().replaceAll("`", "")
+  .equalsIgnoreCase(schemaPath.getRootSegment().getPath())) {
+logger.debug("Forcing 'old' reader due to nested column: {}", 
schemaPath.getUnIndexed().toString());
 
 Review comment:
   Fixed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539962#comment-16539962
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201661724
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -161,20 +167,24 @@ private ParquetMetadata readFooter(Configuration conf, 
String path) throws IOExc
 }
   }
 
-  private boolean isComplex(ParquetMetadata footer) {
+  private boolean isComplex(ParquetMetadata footer, List columns) {
 MessageType schema = footer.getFileMetaData().getSchema();
 
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+if (Utilities.isStarQuery(columns)) {
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
   }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
   }
+  return false;
+} else {
+  return containsComplexColumn(footer, columns);
 
 Review comment:
   Methods combined in one and moved to utility class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539961#comment-16539961
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201661674
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testParquetReaderDecision() {
 
 Review comment:
   Yes, just please generate small file to cover all needed cases.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539957#comment-16539957
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201661188
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+for (SchemaPath schemaPath : columns) {
+  // non-nested column check: full path must be equal to root segment path
+  if (!schemaPath.getUnIndexed().toString().replaceAll("`", "")
 
 Review comment:
   Replaced with `SchemaPath.isLeaf()`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539310#comment-16539310
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201517913
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testParquetReaderDecision() {
 
 Review comment:
   I will fix this and all other test issues you highlighted. When fixing them 
I realised that existing `complex.parquet` test file possibly doesn't provide 
desired coverage for changes in this PR. Bulk part of changes are aimed at 
making utility functions work with schemas like 
   `a`
   `b`.`a`
   while schema in `complex.parquet` doesn't contain such corner cases.
   Does it make sense to add new resource to cover such cases and base tests on 
that file?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538647#comment-16538647
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201328868
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
 
 Review comment:
   You do not need to store conf at class level.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538645#comment-16538645
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201327153
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testParquetReaderDecision() {
+List caseOldReader1 = new ArrayList<>();
+List caseOldReader2 = new ArrayList<>();
+List caseOldReader3 = new ArrayList<>();
+List caseNewReader1 = new ArrayList<>();
+List caseNewReader2 = new ArrayList<>();
+List caseNewReader3 = new ArrayList<>();
+
+SchemaPath topNestedPath = SchemaPath.getCompoundPath("marketing_info");
+SchemaPath nestedColumnPath = SchemaPath.getCompoundPath("marketing_info", 
"camp_id");
+SchemaPath topPath1 = SchemaPath.getCompoundPath("trans_id");
+SchemaPath topPath2 = SchemaPath.getCompoundPath("amount");
+SchemaPath nonExistentPath = SchemaPath.getCompoundPath("nonexistent");
+
+caseOldReader1.add(topNestedPath);
+caseOldReader2.add(nestedColumnPath);
+caseOldReader3.add(topPath1);
+caseOldReader3.add(nestedColumnPath);
+
+caseNewReader1.add(topPath1);
+caseNewReader2.add(topPath1);
+caseNewReader2.add(topPath2);
+
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader1));
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader2));
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader3));
+
+assertFalse("No complex column, but complex column is detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseNewReader1));
 
 Review comment:
   Though you have made method public static for testing, it's not a good 
practice, move method to utilities then. Semantically 
`AbstractParquetScanBatchCreator` definitely does not need to have this method 
as public static. As I have mentioned before `isComplex` and 
`containsComplexColumn` should be one method, might be utility one to ease 
testing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the 

[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538644#comment-16538644
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201326149
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
 
 Review comment:
   Please use `String.format`. Also you just passing error message but not the 
full stack trace. Which might be useful. I would not catch the exception at all 
and just add it to method signature. When the test fails, it would be obvious 
what happend.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538642#comment-16538642
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201325527
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+for (SchemaPath schemaPath : columns) {
+  // non-nested column check: full path must be equal to root segment path
+  if (!schemaPath.getUnIndexed().toString().replaceAll("`", "")
+  .equalsIgnoreCase(schemaPath.getRootSegment().getPath())) {
+logger.debug("Forcing 'old' reader due to nested column: {}", 
schemaPath.getUnIndexed().toString());
+return true;
+  }
+
+  // following column descriptor lookup failure may mean two cases, 
depending on subsequent SchemaElement lookup:
+  // 1. success: queried column is complex => use old reader
+  // 2. failure: queried column is not in schema => use new reader
+  ColumnDescriptor column = 
colDescMap.get(schemaPath.getUnIndexed().toString().toLowerCase());
+
+  if (column == null) {
+SchemaElement se = 
schemaElements.get(schemaPath.getUnIndexed().toString().toLowerCase());
 
 Review comment:
   Not in favor of naming variables as `se`, does not convey any meaning.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538640#comment-16538640
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201354650
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -161,20 +167,24 @@ private ParquetMetadata readFooter(Configuration conf, 
String path) throws IOExc
 }
   }
 
-  private boolean isComplex(ParquetMetadata footer) {
+  private boolean isComplex(ParquetMetadata footer, List columns) {
 MessageType schema = footer.getFileMetaData().getSchema();
 
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+if (Utilities.isStarQuery(columns)) {
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
   }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
   }
+  return false;
+} else {
+  return containsComplexColumn(footer, columns);
 
 Review comment:
   I based this one on original PR, but indeed it's a valid point. Will regroup 
accordingly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538627#comment-16538627
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201352437
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
 
 Review comment:
   Actually I took it from another test as I saw it in many other locations so 
considered maintaining the same naming convention. I'll correct it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538615#comment-16538615
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201350337
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +140,87 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
+
+// skip first default 'root' element
+if (iter.hasNext()) {
+  iter.next();
+}
+while (iter.hasNext()) {
+  addSchemaElementMapping(iter, new StringBuilder(), schemaElements);
 }
 return schemaElements;
   }
 
+  /**
+   * Populate full path to SchemaElement map by recursively traversing schema 
elements referenced by the given iterator
+   *
+   * @param iter file schema values iterator
+   * @param path parent schema element path
+   * @param schemaElements schema elements map to insert next iterator element 
into
+   */
+  private static void addSchemaElementMapping(Iterator iter, StringBuilder 
path,
+  Map schemaElements) {
+SchemaElement se = (SchemaElement)iter.next();
+path.append('`').append(se.getName().toLowerCase()).append('`');
+schemaElements.put(path.toString(), se);
+
+int remainingChildren = se.getNum_children();
+
+while (remainingChildren > 0 && iter.hasNext()) {
 
 Review comment:
   `iterator.hasNext()` relates to whether there are more elements in flat tree 
representation and isin itself not enough to maintain correct recursion level. 
If we don't maintain/check remaining children count for parent element, schema
   `a`.`b`
   `a`.`c`
   will result into keys in the map:
   `a`
   `a`.`b`
   `a`.`b`.`c`
   since function will erratically enter recursion where it should not.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538606#comment-16538606
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201345928
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +140,87 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
+
+// skip first default 'root' element
+if (iter.hasNext()) {
 
 Review comment:
   I think then you may skip the first element, but please add comment 
describing the reason.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538597#comment-16538597
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201344111
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +140,87 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
+
+// skip first default 'root' element
+if (iter.hasNext()) {
 
 Review comment:
   First element in `FileMetaData.getSchema()` is default element named `root`. 
If we don't skip it, map keys would have format:
   `root`.`a`
   `root`.`b`
   etc.
   Correspondingly, comparison with `SchemaPath.toString()` representation 
would need to be implemented with that in mind (`root` part would have to be 
cut anyway). If this is considered to be better approach, I could implement it 
that way indeed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538544#comment-16538544
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201328263
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
 
 Review comment:
   `String.format`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538535#comment-16538535
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201324236
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+for (SchemaPath schemaPath : columns) {
+  // non-nested column check: full path must be equal to root segment path
+  if (!schemaPath.getUnIndexed().toString().replaceAll("`", "")
 
 Review comment:
   That's definitely not a good choice of comparing by replacing "`". Please 
find a different way to compare or write helper method that would compare 
without replacing, maybe traversing schema path. Schema path has plenty of 
handy methods, if needed is missing implement one. Basically you cannot rely on 
to string representation it can change, you should rely on object info instead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538541#comment-16538541
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201328868
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
 
 Review comment:
   You do need to store conf at class level.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538547#comment-16538547
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201327371
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testSchemaElementsMap() {
+assertTrue(footer != null);
 
 Review comment:
   Why we check footer nullability in each test, can be done in `@BeforeClass` 
method.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538540#comment-16538540
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201326149
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
 
 Review comment:
   Please use `String.format`. Also you just posing error message but not the 
full stack trace. Which might be useful. I would not catch the exception at all 
and just add it to method signature. When the test fails, it would be obvious 
what hapend.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538551#comment-16538551
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201330917
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +140,87 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
+
+// skip first default 'root' element
+if (iter.hasNext()) {
+  iter.next();
+}
+while (iter.hasNext()) {
+  addSchemaElementMapping(iter, new StringBuilder(), schemaElements);
 }
 return schemaElements;
   }
 
+  /**
+   * Populate full path to SchemaElement map by recursively traversing schema 
elements referenced by the given iterator
+   *
+   * @param iter file schema values iterator
+   * @param path parent schema element path
+   * @param schemaElements schema elements map to insert next iterator element 
into
+   */
+  private static void addSchemaElementMapping(Iterator iter, StringBuilder 
path,
+  Map schemaElements) {
+SchemaElement se = (SchemaElement)iter.next();
 
 Review comment:
   No need for casting if you pass parametrized iterator...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538538#comment-16538538
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201326437
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testParquetReaderDecision() {
 
 Review comment:
   Please split into several unit tests, unit tests names should show what 
cases they are testing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538539#comment-16538539
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201327153
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testParquetReaderDecision() {
+List caseOldReader1 = new ArrayList<>();
+List caseOldReader2 = new ArrayList<>();
+List caseOldReader3 = new ArrayList<>();
+List caseNewReader1 = new ArrayList<>();
+List caseNewReader2 = new ArrayList<>();
+List caseNewReader3 = new ArrayList<>();
+
+SchemaPath topNestedPath = SchemaPath.getCompoundPath("marketing_info");
+SchemaPath nestedColumnPath = SchemaPath.getCompoundPath("marketing_info", 
"camp_id");
+SchemaPath topPath1 = SchemaPath.getCompoundPath("trans_id");
+SchemaPath topPath2 = SchemaPath.getCompoundPath("amount");
+SchemaPath nonExistentPath = SchemaPath.getCompoundPath("nonexistent");
+
+caseOldReader1.add(topNestedPath);
+caseOldReader2.add(nestedColumnPath);
+caseOldReader3.add(topPath1);
+caseOldReader3.add(nestedColumnPath);
+
+caseNewReader1.add(topPath1);
+caseNewReader2.add(topPath1);
+caseNewReader2.add(topPath2);
+
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader1));
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader2));
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader3));
+
+assertFalse("No complex column, but complex column is detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseNewReader1));
 
 Review comment:
   Though you have made method public static for testing, it's good practice, 
move method to utilities then. Semantically `AbstractParquetScanBatchCreator` 
definitely does not need to have this method as public static. As I have 
mentioned before `isComplex` and `containsComplexColumn` should be one method, 
might be utility one to ease testing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the 

[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538548#comment-16538548
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201328013
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
 
 Review comment:
   Do you use `conf` elsewhere except of  `setUp`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538542#comment-16538542
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201324792
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -161,20 +167,24 @@ private ParquetMetadata readFooter(Configuration conf, 
String path) throws IOExc
 }
   }
 
-  private boolean isComplex(ParquetMetadata footer) {
+  private boolean isComplex(ParquetMetadata footer, List columns) {
 MessageType schema = footer.getFileMetaData().getSchema();
 
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+if (Utilities.isStarQuery(columns)) {
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
   }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
   }
+  return false;
+} else {
+  return containsComplexColumn(footer, columns);
 
 Review comment:
   Basically `isComplex` and `containsComplexColumn` does the same, why would 
split it into semantically the same methods?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538543#comment-16538543
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201325125
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+for (SchemaPath schemaPath : columns) {
+  // non-nested column check: full path must be equal to root segment path
+  if (!schemaPath.getUnIndexed().toString().replaceAll("`", "")
+  .equalsIgnoreCase(schemaPath.getRootSegment().getPath())) {
+logger.debug("Forcing 'old' reader due to nested column: {}", 
schemaPath.getUnIndexed().toString());
+return true;
+  }
+
+  // following column descriptor lookup failure may mean two cases, 
depending on subsequent SchemaElement lookup:
+  // 1. success: queried column is complex => use old reader
+  // 2. failure: queried column is not in schema => use new reader
+  ColumnDescriptor column = 
colDescMap.get(schemaPath.getUnIndexed().toString().toLowerCase());
+
+  if (column == null) {
+SchemaElement se = 
schemaElements.get(schemaPath.getUnIndexed().toString().toLowerCase());
+if (se != null) {
+  return true;
+}
+  } else {
+if (column.getMaxRepetitionLevel() > 0) {
+  logger.debug("Forcing 'old' reader due to repetitive column: {}", 
schemaPath.getUnIndexed().toString());
 
 Review comment:
   The same as above.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538537#comment-16538537
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201325934
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
 
 Review comment:
   Again method you are testing has nothing to do with reader decision, it's 
just tells if there are complex columns...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538536#comment-16538536
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201324454
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+for (SchemaPath schemaPath : columns) {
+  // non-nested column check: full path must be equal to root segment path
+  if (!schemaPath.getUnIndexed().toString().replaceAll("`", "")
+  .equalsIgnoreCase(schemaPath.getRootSegment().getPath())) {
+logger.debug("Forcing 'old' reader due to nested column: {}", 
schemaPath.getUnIndexed().toString());
 
 Review comment:
   This method has nothing to do with forcing the reader, it's just tells if 
given schema path contain complex columns.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538552#comment-16538552
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201331494
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +140,87 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
+
+// skip first default 'root' element
+if (iter.hasNext()) {
+  iter.next();
+}
+while (iter.hasNext()) {
+  addSchemaElementMapping(iter, new StringBuilder(), schemaElements);
 }
 return schemaElements;
   }
 
+  /**
+   * Populate full path to SchemaElement map by recursively traversing schema 
elements referenced by the given iterator
+   *
+   * @param iter file schema values iterator
+   * @param path parent schema element path
+   * @param schemaElements schema elements map to insert next iterator element 
into
+   */
+  private static void addSchemaElementMapping(Iterator iter, StringBuilder 
path,
+  Map schemaElements) {
+SchemaElement se = (SchemaElement)iter.next();
+path.append('`').append(se.getName().toLowerCase()).append('`');
+schemaElements.put(path.toString(), se);
+
+int remainingChildren = se.getNum_children();
+
+while (remainingChildren > 0 && iter.hasNext()) {
 
 Review comment:
   Why we need to count remaining children? Why `iterator.hasNext()` is not 
sufficient?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538546#comment-16538546
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201329822
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +140,87 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
+
+// skip first default 'root' element
+if (iter.hasNext()) {
 
 Review comment:
   Why we need to skip root element?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538549#comment-16538549
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201325527
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+for (SchemaPath schemaPath : columns) {
+  // non-nested column check: full path must be equal to root segment path
+  if (!schemaPath.getUnIndexed().toString().replaceAll("`", "")
+  .equalsIgnoreCase(schemaPath.getRootSegment().getPath())) {
+logger.debug("Forcing 'old' reader due to nested column: {}", 
schemaPath.getUnIndexed().toString());
+return true;
+  }
+
+  // following column descriptor lookup failure may mean two cases, 
depending on subsequent SchemaElement lookup:
+  // 1. success: queried column is complex => use old reader
+  // 2. failure: queried column is not in schema => use new reader
+  ColumnDescriptor column = 
colDescMap.get(schemaPath.getUnIndexed().toString().toLowerCase());
+
+  if (column == null) {
+SchemaElement se = 
schemaElements.get(schemaPath.getUnIndexed().toString().toLowerCase());
 
 Review comment:
   Not in favor of naming variables as `se`, does not any any meaning.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538554#comment-16538554
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201330725
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +140,87 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
 
 Review comment:
   `Iterator iterator`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538534#comment-16538534
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201323486
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
 
 Review comment:
   Why it's public and static?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538550#comment-16538550
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201327704
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testSchemaElementsMap() {
+assertTrue(footer != null);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+assertTrue(schemaElements.size() == 14);
+
+SchemaElement se = schemaElements.get("`marketing_info`.`camp_id`");
+assertTrue(se != null);
 
 Review comment:
   `assertNotNull`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538545#comment-16538545
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201327587
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testSchemaElementsMap() {
+assertTrue(footer != null);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+assertTrue(schemaElements.size() == 14);
+
+SchemaElement se = schemaElements.get("`marketing_info`.`camp_id`");
+assertTrue(se != null);
+assertTrue("camp_id".equals(se.getName()));
 
 Review comment:
   Use `assertEquals` here and below.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538553#comment-16538553
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201328152
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
 
 Review comment:
   I think annotation makes it obvious that this method is before class, no 
need to add this in method name.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538261#comment-16538261
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin opened a new pull request #1370: DRILL-5797: Use Parquet new reader 
more often
URL: https://github.com/apache/drill/pull/1370
 
 
   # DRILL-5797: use Parquet new reader more often
   ## Background
   This PR is follow up on previous work done by @dprofeta and documented in 
the JIRA. Previously new reader was only used if file schema did not contain 
any single complex column. With this change, new reader will be used on a 
complex schema in case queried column list does not contain any complex one 
which should make new reader usage more frequent.
   
   ## Change description
   In order to make usage of new reader possible on complex schema, following 
modifications had to be made:
   
   - `ParquetReaderUtility` class - modified and added several functions to 
enable it working with nested schema. E.g. one limitation was explicitly 
referencing top level schema element path with `column.getPath()[0]` in several 
locations. Top level schema element path was also used in building path to 
`SchemaElement` map which caused map corruption for cases when schema contained 
columns `a` and `b.a` (for both schema elements key `a` was used overwriting 
the map entry).
   - `ParquetSchema` - `fieldSelected()` function replaced with 
`columnSelected()` in order to enable it functioning with full paths. 
Previously, it would fail on cases when schema contains columns a and b.a as 
both schema paths would be classified as selected.
   - `ParquetColumnMetadata` - replaced top level path reference with full 
path; also, replaced parameter passed to 
`ParquetToDrillTypeConverter.toMajorType()` from `se.getType_length()` to 
`column.getTypeLength()`. Reason behind is `se.getType_length()` returning 0 on 
FIXED_LEN_BYTE_ARRAY column and subsequent failure in minor type conversion 
that was failing complex parquet tests. `column.getTypeLength()` provides 
correct result. In fact, I am not sure if this is Parquet bug - possibly TBD 
item.
   - `AbstractParquetScanBatchCreator` - added function which utilizes 
`ParquetReaderUtility` functions to identify whether query columns list 
contains complex columns and thus whether query qualifies for new reader.
   
   Added tests rely on existing `complex.parquet` file used in other tests.
   
   ## Level of testing
   build tests and complex*q query tests from Drill test framework. Tests added 
for newly introduced methods except for 
`ParquetReaderUtility.buildFullColumnPath()`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537638#comment-16537638
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on issue #1368: DRILL-5797: use Parquet new reader more often
URL: https://github.com/apache/drill/pull/1368#issuecomment-403628124
 
 
   Rebasing for commits squash triggered an error, will check what went wrong.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537519#comment-16537519
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin removed a comment on issue #1368: DRILL-5797: use Parquet new reader 
more often
URL: https://github.com/apache/drill/pull/1368#issuecomment-403550025
 
 
   One conflict originates from non-fetched upstream master source. Will 
correct that.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537270#comment-16537270
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin commented on issue #1368: DRILL-5797: use Parquet new reader more often
URL: https://github.com/apache/drill/pull/1368#issuecomment-403550025
 
 
   One conflict originates from non-fetched upstream master source. Will 
correct that.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537213#comment-16537213
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva closed pull request #976: DRILL-5797: Choose parquet reader 
from read columns
URL: https://github.com/apache/drill/pull/976
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 
b/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
index 60179482fc3..82764285e99 100644
--- 
a/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
+++ 
b/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
@@ -26,6 +26,8 @@
 import com.google.common.base.Stopwatch;
 import com.google.common.collect.Maps;
 import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.PathSegment;
+import org.apache.drill.common.expression.SchemaPath;
 import org.apache.drill.exec.ExecConstants;
 import org.apache.drill.exec.ops.FragmentContext;
 import org.apache.drill.exec.ops.OperatorContext;
@@ -37,12 +39,15 @@
 import org.apache.drill.exec.store.dfs.DrillFileSystem;
 import org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader;
 import org.apache.drill.exec.store.parquet2.DrillParquetReader;
+import org.apache.drill.exec.util.Utilities;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.parquet.column.ColumnDescriptor;
 import org.apache.parquet.hadoop.CodecFactory;
 import org.apache.parquet.hadoop.ParquetFileReader;
 import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.io.InvalidRecordException;
+import org.apache.parquet.schema.GroupType;
 import org.apache.parquet.schema.MessageType;
 import org.apache.parquet.schema.Type;
 
@@ -119,7 +124,7 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 if (logger.isDebugEnabled()) {
   logger.debug(containsCorruptDates.toString());
 }
-if 
(!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val
 && !isComplex(footers.get(e.getPath( {
+if 
(!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val
 && !isComplex(footers.get(e.getPath()), rowGroupScan.getColumns())) {
   readers.add(
   new ParquetRecordReader(
   context, e.getPath(), e.getRowGroupIndex(), 
e.getNumRecordsToRead(), fs,
@@ -156,20 +161,49 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
   }
 
-  private static boolean isComplex(ParquetMetadata footer) {
+  private static boolean isComplex(ParquetMetadata footer, List 
columns) {
+/**
+ * ParquetRecordReader is not able to read any nested columns and is not 
able to handle repeated columns.
+ * It only handles flat column and optional column.
+ * If it is a wildcard query, we check every columns in the metadata.
+ * If not, we only check the projected columns.
+ * We only check the first level columns because :
+ *   - if we need a.b, it means a is a complex type, no need to check b as 
we don't handle complex type.
+ *   - if we need a[10], a is repeated, ie its repetiton level is greater 
than 0
+ *   - if we need a, it is at the first level of the schema.
+ */
 MessageType schema = footer.getFileMetaData().getSchema();
-
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+if (Utilities.isStarQuery(columns)) {
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
   }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
+  }
+  return false;
+} else {
+  for (SchemaPath column : columns) {
+if (isColumnComplex(schema, column)) {
+  return true;
+}
   }
+  return false;
+}
+  }
+
+  private static boolean isColumnComplex(GroupType grouptype, SchemaPath 
column) {
+try {
+  Type type = 
grouptype.getType(column.getRootSegment().getPath().toLowerCase());
+  return type.isRepetition(Type.Repetition.REPEATED) || 

[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537212#comment-16537212
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

arina-ielchiieva commented on issue #976: DRILL-5797: Choose parquet reader 
from read columns
URL: https://github.com/apache/drill/pull/976#issuecomment-403541727
 
 
   Closing this PR, since new one was opened - 
https://github.com/apache/drill/pull/1368.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537205#comment-16537205
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

okalinin opened a new pull request #1368: DRILL-5797: use Parquet new reader 
more often
URL: https://github.com/apache/drill/pull/1368
 
 
   # DRILL-5797: use Parquet new reader more often
   
   ## Background
   This PR is follow up on previous work done by @dprofeta and documented in 
the JIRA. Previously  new reader was only used if file schema did not contain 
any single complex column. With this change, new reader will be used on a 
complex schema in case queried column list does not contain any complex one 
which should make new reader usage more frequent.
   
   ## Change description
   In order to make usage of new reader possible on complex schema, following 
modifications had to be made:
   - `ParquetReaderUtility` class - modified and added several functions to 
enable it working with nested schema. E.g. one limitation was explicitly 
referencing top level schema element path with `column.getPath()[0]` in several 
locations. Top level schema element path was also used in building path to 
SchemaElement map which caused map corruption for cases when schema contained 
columns `a` and `b`.`a` (for both schema elements key `a` was used overwriting 
the map entry).
   - `ParquetSchema` - `fieldSelected` function replaced with `columnSelected` 
in order to enable it functioning with full path. Previously, it would fail on 
cases when schema contains columns `a` and `b`.`a` as both schema paths would 
be marked as selected.
   - `ParquetColumnMetadata` - replaced top level path reference with full 
path; also, replaced parameter passed to 
`ParquetToDrillTypeConverter.toMajorType()` from `se.getType_length()` to 
`column.getTypeLength()`. Reason behind is `se.getType_length()` returning 0 on 
FIXED_LEN_BYTE_ARRAY column and subsequent failure in minor type conversion 
that was failing complex parquet tests. `column.getTypeLength()` provides 
correct result. In fact, I am not sure if this is Parquet bug - possibly TBD 
item.
   - `AbstractParquetScanBatchCreator` - added a function which utilizes 
`ParquetReaderUtility` functions to identify if query columns list contains 
complex column.
   
   Added tests rely on existing `complex.parquet` file used in other tests.
   
   ## Level of testing
   build tests and complex*q query tests from Drill test framework. Tests added 
for newly introduced methods except for 
`ParquetReaderUtility.buildFullColumnPath()`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-06-07 Thread Oleksandr Kalinin (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504566#comment-16504566
 ] 

Oleksandr Kalinin commented on DRILL-5797:
--

Current state:

- several ParquetReaderUtility classes have been updated for nested schema file 
(that required also minor addition to SchemaPath - a method to get full path as 
String)
- fieldSelected fixed
- case sensitivity issue fixed
- Unittests and complex.q queries are OK

All in all it has become more complex change that original PR, I am currently 
cleaning up the code for github upload and initial review.

> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.14.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-06-06 Thread Pritesh Maker (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503708#comment-16503708
 ] 

Pritesh Maker commented on DRILL-5797:
--

[~okalinin] do you have further updates on this? We want to try and wrap up the 
Drill 1.14 release by the end of the month.

> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.14.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-05-15 Thread Arina Ielchiieva (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476023#comment-16476023
 ] 

Arina Ielchiieva commented on DRILL-5797:
-

It looks like ParquetSchema is sharpened to work with flat schema. In our case 
we are passing complex schema since we only intend to read simple columns. 
Looks like we need to review ParquetSchema class and determine places where fix 
is needed so it handles complex fields names correctly, so far it is 
fieldSelected and getColNameToSchemaElementMapping methods but could be more.

> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.14.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-05-15 Thread Oleksandr Kalinin (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475815#comment-16475815
 ] 

Oleksandr Kalinin commented on DRILL-5797:
--

[~arina] Thanks for your feedback. The problem with ParquetSchema seems to be 
that buildSchema() can't be called on a complex schema file as classes like 
ParquetReaderUtility used in ParquetSchema rely on flat schema. E.g. 
getColNameToSchemaElementMapping() returns corrupted / unusable structure if 
called on a nested schema.

In other words, ParquetSchema can only be built and used after ensuring that 
schema is flat, unless more refactoring-like work is done to support nested 
data (I spotted other locations that explicitly rely on flat schema).

So to keep things simple I am considering adding static method to ParquetSchema 
or even ParquetRecordUtility, something like isSuitableForFastReader(), which 
would do necessary checks based on input parameters (footer, selected columns 
etc) and serve as a gate for using the new reader.

> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.14.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-05-15 Thread Arina Ielchiieva (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475765#comment-16475765
 ] 

Arina Ielchiieva commented on DRILL-5797:
-

[~okalinin] looked at the complex12.q as well. Got the same conclusions as you 
are. My understanding is that we need to rewrite 
{{ParquetSchema.fieldSelected}} method to choose correct column rather then 
two. Did you try that?

> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.14.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-04-27 Thread Oleksandr Kalinin (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456529#comment-16456529
 ] 

Oleksandr Kalinin commented on DRILL-5797:
--

Just for a record, further debugging shows how complex column sneaks into 
ReadState:

(1) `ParquetRecordReader.setup()` triggers ParquetSchema 
buildSchema/loadParquetSchema for column mapping
(2) `ParquetSchema.loadParquetSchema()` is using `ParqueSchema.fieldSelected()` 
for column matching
(3) fieldSelected() takes MaterializedField as an argument and uses it's 
getName() method for column name comparison. For column B.A it returns A.
(4) As result of that, column B.A of the file gets positively matched to column 
A and gets added to selectedColumnMetadata in the ParquetSchema which is then 
passed to ReadState

> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.14.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-04-26 Thread Pritesh Maker (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455729#comment-16455729
 ] 

Pritesh Maker commented on DRILL-5797:
--

Thanks for looking into the issue, [~okalinin] . I have assigned the issue here.

[~sachouche], [~arina] do you have any suggestions for resolving the second 
issue? 

> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.14.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-04-26 Thread Oleksandr Kalinin (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454503#comment-16454503
 ] 

Oleksandr Kalinin commented on DRILL-5797:
--

When debugging complex12.q failure from the list of failing queries above it 
appears that there is another not related to case sensitivity. 

If file schema has primitive column A and repeated column B with nested column 
A (B.A), then executing query 'select A from ' leads to following scenario:

(1) rowGroupScan passed to ParquetScanBatchCreator contains only column A. That 
will be correctly handled by the code in PR allowing the fast reader
(2) However, ParquetSchema passed to ReadStat will contain both A and B.A which 
leads to failure explained above in this JIRA as B.A is complex

Looks like additional issue, not related to PR code though. I also could 
reproduce case sensitivity issue, investigating both issues currently.

> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
>Priority: Major
> Fix For: 1.14.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2018-03-01 Thread Pritesh Maker (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382747#comment-16382747
 ] 

Pritesh Maker commented on DRILL-5797:
--

Changed the fixVersion 1.14 since there hasn't been any update recently.

> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
>Priority: Major
> Fix For: 1.14.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252662#comment-16252662
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user priteshm commented on the issue:

https://github.com/apache/drill/pull/976
  
@dprofeta will you be able to address the issues before the release?


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-11-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234749#comment-16234749
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/976
  
Drill follows SQL rules and is case insensitive. If case sensitivity has 
snuck in somewhere (perhaps due to the use of `equals()` rather than 
`equalsIgnorCase()` or the use of a case-sensitive map), then we should fix 
that.

Note also that column aliases should not be visible to the Parquet reader.


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
>Priority: Major
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-11-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234473#comment-16234473
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/976
  
Looking at the stack trace:
- The code definitely is initializing a column of type REPEATABLE
- The Fast Reader didn't expect this scenario so it used a default 
container (NullableVarBinary) for VL binary DT

Why this is happening?
- The code in ReadState::buildReader() is processing all selected columns
- This information is obtained from the ParquetSchema
- Looking at the code, this seems a case-sensitivity issue
- The ParquetSchema is case-insensitive whereas the Parquet GroupType is not
- Damien added a catch handler (column not found) to handle use-cases where 
we are projecting non-existing columns
- This basically is leading to an unforeseen use-case
- Assume column XYZ is complex
- User uses an alias (xyz)
- The new code will allow this column to pass and treat is as simple
- The ParquetSchema is being case insensitive will process this column
- and thus the exception in the test suite

Suggested Fix
- Create a map (key to-lower-case) and register all current row-group 
columns
- Use this map to locate a selected column type



> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
>Priority: Major
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225786#comment-16225786
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/976
  
@dprofeta, tried to commit this PR, but ran into multiple functional test 
failures:

```
Execution Failures:

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex12.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex8.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex56.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex274.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex7.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex57.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex102.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex5.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex10.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex9.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex203.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex101.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex275.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex6.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex205.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex11.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex58.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex153.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex202.q

/root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex151.q
```

The common failure stack trace seems to be:

```

org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleException():272

org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.setup():256
org.apache.drill.exec.physical.impl.ScanBatch.getNextReaderIfHas():241
org.apache.drill.exec.physical.impl.ScanBatch.next():167
...
``` 


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224320#comment-16224320
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/976
  
+1
looks good!


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224319#comment-16224319
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user priteshm commented on the issue:

https://github.com/apache/drill/pull/976
  
@sachouche can you please take a final look? If it looks good, maybe one of 
the committers can include this for the 1.12 release. @arina-ielchiieva ?


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207126#comment-16207126
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user dprofeta commented on the issue:

https://github.com/apache/drill/pull/976
  
I updated the javadoc with Paul remarks.


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206278#comment-16206278
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/976#discussion_r144074677
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -156,20 +161,46 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
   }
 
-  private static boolean isComplex(ParquetMetadata footer) {
-MessageType schema = footer.getFileMetaData().getSchema();
-
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+  private static boolean isComplex(ParquetMetadata footer, 
List columns) {
+/*
+ParquetRecordReader is not able to read any nested columns and is not 
able to handle repeated columns.
+It only handles flat column and optional column.
+If it is a wildcard query, we check every columns in the metadata.
+If not, we only check the projected columns.
+*/
--- End diff --

Very small request: this is a great Javadoc comment, so please use this 
form:

```
/**
 * Your comment here.
 */
```

It may also be worth pointing out that the algorithm here works regardless 
of the form of the column:

* `a`: Must consider the type of column `a` in Parquet.
* `a.b`: The top level column `a` must be a structure in Parquet. (If not, 
then presumably an error is thrown later on.) So, no need to check `b`.
* `a[10]`: The column `a` must be an array (repeated), so no need to check 
the column `SchemaPath` itself. Again, presumably, Drill will throw an error 
internally if it turns out that `a` is not an array.


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206021#comment-16206021
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user dprofeta commented on the issue:

https://github.com/apache/drill/pull/976
  
here is the updated PR.
Yes, I also wanted to add group without repetition. It is only a matter of 
naming so it should not be hard but when I tested, the fast reader was not able 
to handle it.


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199284#comment-16199284
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/976#discussion_r143837353
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
   }
 
-  private static boolean isComplex(ParquetMetadata footer) {
-MessageType schema = footer.getFileMetaData().getSchema();
+  private static boolean isComplex(ParquetMetadata footer, 
List columns) {
+if (Utilities.isStarQuery(columns)) {
+  MessageType schema = footer.getFileMetaData().getSchema();
 
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
   }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
+  }
+  return false;
+} else {
+  for (SchemaPath column : columns) {
+if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) 
{
--- End diff --

Can you please use the already defined schema variable instead of invoking 
"footer.getFileMetaData().getSchema()" multiple times.


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196623#comment-16196623
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user dprofeta commented on a diff in the pull request:

https://github.com/apache/drill/pull/976#discussion_r143403657
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
   }
 
-  private static boolean isComplex(ParquetMetadata footer) {
-MessageType schema = footer.getFileMetaData().getSchema();
+  private static boolean isComplex(ParquetMetadata footer, 
List columns) {
+if (Utilities.isStarQuery(columns)) {
--- End diff --

Added in a new commit.


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196621#comment-16196621
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user dprofeta commented on a diff in the pull request:

https://github.com/apache/drill/pull/976#discussion_r143403559
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
   }
 
-  private static boolean isComplex(ParquetMetadata footer) {
-MessageType schema = footer.getFileMetaData().getSchema();
+  private static boolean isComplex(ParquetMetadata footer, 
List columns) {
+if (Utilities.isStarQuery(columns)) {
+  MessageType schema = footer.getFileMetaData().getSchema();
 
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
   }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
+  }
+  return false;
+} else {
+  for (SchemaPath column : columns) {
+if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) 
{
+  return true;
+}
   }
+  return false;
+}
+  }
+
+  private static boolean isColumnComplex(GroupType grouptype, SchemaPath 
column) {
+PathSegment.NameSegment root = column.getRootSegment();
+if (!grouptype.containsField(root.getPath().toLowerCase())) {
+  return false;
+}
+Type type = grouptype.getType(root.getPath().toLowerCase());
+if (type.isRepetition(Type.Repetition.REPEATED) || 
!type.isPrimitive()) {
--- End diff --

Yes, sure. I wanted to check it in a loop first, but ParquetRecordReader 
doesn't handle any nested type, so the loop is not needed now. But I didn't 
refactor enough.


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196619#comment-16196619
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user dprofeta commented on a diff in the pull request:

https://github.com/apache/drill/pull/976#discussion_r143403232
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
   }
 
-  private static boolean isComplex(ParquetMetadata footer) {
-MessageType schema = footer.getFileMetaData().getSchema();
+  private static boolean isComplex(ParquetMetadata footer, 
List columns) {
+if (Utilities.isStarQuery(columns)) {
+  MessageType schema = footer.getFileMetaData().getSchema();
 
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
   }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
+  }
+  return false;
+} else {
+  for (SchemaPath column : columns) {
+if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) 
{
+  return true;
+}
   }
+  return false;
+}
+  }
+
+  private static boolean isColumnComplex(GroupType grouptype, SchemaPath 
column) {
+PathSegment.NameSegment root = column.getRootSegment();
+if (!grouptype.containsField(root.getPath().toLowerCase())) {
+  return false;
+}
+Type type = grouptype.getType(root.getPath().toLowerCase());
--- End diff --

ok, for `getType()`. It throws an exception so I will catch it.
I don't see any `getName()` in the `SchemaPath` / `PathSegment` class. Can 
you tell me which `getName()` you mean?


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194998#comment-16194998
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/976#discussion_r143263740
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
   }
 
-  private static boolean isComplex(ParquetMetadata footer) {
-MessageType schema = footer.getFileMetaData().getSchema();
+  private static boolean isComplex(ParquetMetadata footer, 
List columns) {
+if (Utilities.isStarQuery(columns)) {
+  MessageType schema = footer.getFileMetaData().getSchema();
 
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
   }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
+  }
+  return false;
+} else {
+  for (SchemaPath column : columns) {
+if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) 
{
+  return true;
+}
   }
+  return false;
+}
+  }
+
+  private static boolean isColumnComplex(GroupType grouptype, SchemaPath 
column) {
+PathSegment.NameSegment root = column.getRootSegment();
+if (!grouptype.containsField(root.getPath().toLowerCase())) {
+  return false;
+}
+Type type = grouptype.getType(root.getPath().toLowerCase());
+if (type.isRepetition(Type.Repetition.REPEATED) || 
!type.isPrimitive()) {
--- End diff --

`return type.isRepetition(Type.Repetition.REPEATED) || !type.isPrimitive();`


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194997#comment-16194997
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/976#discussion_r143263332
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
   }
 
-  private static boolean isComplex(ParquetMetadata footer) {
-MessageType schema = footer.getFileMetaData().getSchema();
+  private static boolean isComplex(ParquetMetadata footer, 
List columns) {
+if (Utilities.isStarQuery(columns)) {
--- End diff --

Perhaps a comment with some explanation? If wildcard query, we query all 
columns, so check if any of them are complex. If project list, then check only 
the projected columns.


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194996#comment-16194996
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/976#discussion_r143264522
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
   }
 
-  private static boolean isComplex(ParquetMetadata footer) {
-MessageType schema = footer.getFileMetaData().getSchema();
+  private static boolean isComplex(ParquetMetadata footer, 
List columns) {
+if (Utilities.isStarQuery(columns)) {
+  MessageType schema = footer.getFileMetaData().getSchema();
 
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
   }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
+  }
+  return false;
+} else {
+  for (SchemaPath column : columns) {
+if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) 
{
+  return true;
+}
   }
+  return false;
+}
+  }
+
+  private static boolean isColumnComplex(GroupType grouptype, SchemaPath 
column) {
+PathSegment.NameSegment root = column.getRootSegment();
+if (!grouptype.containsField(root.getPath().toLowerCase())) {
+  return false;
+}
+Type type = grouptype.getType(root.getPath().toLowerCase());
--- End diff --

What are the semantics of `getType()`? Does it return null if there is no 
such type? If so, then we can retrieve the type and check if it is null. 
Otherwise, if it throws an exception, perhaps we can catch that. Either way, we 
avoid two searches for the same column and two conversions of the path to lower 
case.

Note also that a recent change deprecated `getPath()`. The preferred form 
is `getName()` so it is clear that we are getting a single name part. If the 
Parquet column is nested (a.b.c, say), then we have to explicitly handle the 
multiple name parts as Drill does support names with dots and one cannot simply 
concatenate or split a path to get a string. That is, `["a.b", "c"]` is two 
fields, `["a", "b", "c"]` is three, but both are represented as a full path as 
"a.b.c", creating an ambiguity.


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194772#comment-16194772
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/976
  
Sure!


Regards,

Salim


From: dprofeta 
Sent: Friday, October 6, 2017 8:52:51 AM
To: apache/drill
Cc: Salim Achouche; Mention
Subject: Re: [apache/drill] DRILL-5797: Choose parquet reader from read 
columns (#976)


@sachouche Can you review it?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on 
GitHub, or 
mute the 
thread.



> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194760#comment-16194760
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

Github user dprofeta commented on the issue:

https://github.com/apache/drill/pull/976
  
@sachouche Can you review it?


> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-06 Thread Damien Profeta (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194310#comment-16194310
 ] 

Damien Profeta commented on DRILL-5797:
---

One improvement over this user story would be to enable the ParquetRecordReader 
to read nested data but which are not repeated. However today, the 
ParquetRecordReader is not able to read these columns because it cannot resolve 
the name (it only look at the last part of it). If it is worth it, I will open 
another US to improve that.

> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

2017-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194306#comment-16194306
 ] 

ASF GitHub Bot commented on DRILL-5797:
---

GitHub user dprofeta opened a pull request:

https://github.com/apache/drill/pull/976

DRILL-5797: Choose parquet reader from read columns

ParquetRecordReader is not able to read complex columns. However it is
able to read simple columns in a file containing complex
columns. Instead of looking at the file to choose the reader, we
now choose which reader to use based on what columns is asked and if
they are simple or not.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dprofeta/drill DRILL-5797

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/976.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #976


commit 9669dd2c0c61e56c76bc9939c4f1c01aab908baf
Author: Damien Profeta 
Date:   2017-10-06T08:40:22Z

DRILL-5797: Choose parquet reader from read columns

ParquetRecordReader is not able to read complex columns. However it is
able to read simple columns in a file containing complex
columns. Instead of looking at the file to choose the reader, we
now choose which reader to use based on what columns is asked and if
they are simple or not.




> Use more often the new parquet reader
> -
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Damien Profeta
>Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)