[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046700#comment-17046700
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on pull request #1978: DRILL-7578: HDF5 Metadata 
Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046624#comment-17046624
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-591972210
 
 
   @arina-ielchiieva @paul-rogers 
   Thanks for the review!
   
   Rebased and squashed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046601#comment-17046601
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on issue #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-591961582
 
 
   @cgivre please squash the commits and rebase on the latest master. Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046206#comment-17046206
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384933724
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1032,11 +1081,11 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
 
   SchemaBuilder innerSchema = new SchemaBuilder();
   MapBuilder mapBuilder = innerSchema.addMap(COMPOUND_DATA_FIELD_NAME);
-
   for (HDF5CompoundMemberInformation info : infos) {
 fieldNames.add(info.getName());
+String compoundColumnDataType = 
info.getType().tryGetJavaType().getSimpleName();
 
-switch (info.getType().tryGetJavaType().getSimpleName()) {
+switch (compoundColumnDataType) {
 
 Review comment:
   Ah, more complex than a quick glance at the code suggested. OK, let's leave 
it.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046130#comment-17046130
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384902654
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1032,11 +1081,11 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
 
   SchemaBuilder innerSchema = new SchemaBuilder();
   MapBuilder mapBuilder = innerSchema.addMap(COMPOUND_DATA_FIELD_NAME);
-
   for (HDF5CompoundMemberInformation info : infos) {
 fieldNames.add(info.getName());
+String compoundColumnDataType = 
info.getType().tryGetJavaType().getSimpleName();
 
-switch (info.getType().tryGetJavaType().getSimpleName()) {
+switch (compoundColumnDataType) {
 
 Review comment:
   If you think that's better I can change it. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046127#comment-17046127
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384901864
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1032,11 +1081,11 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
 
   SchemaBuilder innerSchema = new SchemaBuilder();
   MapBuilder mapBuilder = innerSchema.addMap(COMPOUND_DATA_FIELD_NAME);
-
   for (HDF5CompoundMemberInformation info : infos) {
 fieldNames.add(info.getName());
+String compoundColumnDataType = 
info.getType().tryGetJavaType().getSimpleName();
 
-switch (info.getType().tryGetJavaType().getSimpleName()) {
+switch (compoundColumnDataType) {
 
 Review comment:
   The `getType()` is actually a poorly named method in that it doesn't return 
the type, but rather an `HDF5DataTypeInformation` object[1]. Once you have this 
object, you still have to call one of the various methods like 
`tryGetJavaType()` to actually get a data type that is useful. 
   
   Bottom line, you'd still have to call all that somewhere and either use a 
switch statement or if statements to process the data type.
   
   [1]: 
http://svnsis.ethz.ch/doc/openbis/S175.0/ch/systemsx/cisd/hdf5/HDF5DataTypeInformation.html
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046125#comment-17046125
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384901864
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1032,11 +1081,11 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
 
   SchemaBuilder innerSchema = new SchemaBuilder();
   MapBuilder mapBuilder = innerSchema.addMap(COMPOUND_DATA_FIELD_NAME);
-
   for (HDF5CompoundMemberInformation info : infos) {
 fieldNames.add(info.getName());
+String compoundColumnDataType = 
info.getType().tryGetJavaType().getSimpleName();
 
-switch (info.getType().tryGetJavaType().getSimpleName()) {
+switch (compoundColumnDataType) {
 
 Review comment:
   The `getType()` is actually a poorly named method in that it doesn't return 
the type, but rather an `HDF5DataTypeInformation` object. [1]. Once you have 
this object, you still have to call one of the various methods like 
`tryGetJavaType()` to actually get a data type that is useful. 
   
   [1]: 
http://svnsis.ethz.ch/doc/openbis/S175.0/ch/systemsx/cisd/hdf5/HDF5DataTypeInformation.html
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046122#comment-17046122
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384900801
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1069,26 +1118,30 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
   for (int col = 0; col < values[row].length; col++) {
 assert fieldNames != null;
 currentFieldName = fieldNames.get(col);
-ArrayWriter innerWriter = listWriter.array(currentFieldName);
-if (values[row][col] instanceof Integer) {
-  innerWriter.scalar().setInt((Integer) values[row][col]);
-} else if (values[row][col] instanceof Short) {
-  innerWriter.scalar().setInt((Short) values[row][col]);
-} else if (values[row][col] instanceof Byte) {
-  innerWriter.scalar().setInt((Byte) values[row][col]);
-} else if (values[row][col] instanceof Long) {
-  innerWriter.scalar().setLong((Long) values[row][col]);
-} else if (values[row][col] instanceof Float) {
-  innerWriter.scalar().setDouble((Float) values[row][col]);
-} else if (values[row][col] instanceof Double) {
-  innerWriter.scalar().setDouble((Double) values[row][col]);
-} else if (values[row][col] instanceof BitSet || values[row][col] 
instanceof Boolean) {
-  innerWriter.scalar().setBoolean((Boolean) values[row][col]);
-} else if (values[row][col] instanceof String) {
-  innerWriter.scalar().setString((String) values[row][col]);
-}
-if (col == values[row].length) {
-  innerWriter.save();
+try {
+  ArrayWriter innerWriter = listWriter.array(currentFieldName);
+  if (values[row][col] instanceof Integer) {
+innerWriter.scalar().setInt((Integer) values[row][col]);
+  } else if (values[row][col] instanceof Short) {
+innerWriter.scalar().setInt((Short) values[row][col]);
+  } else if (values[row][col] instanceof Byte) {
+innerWriter.scalar().setInt((Byte) values[row][col]);
+  } else if (values[row][col] instanceof Long) {
+innerWriter.scalar().setLong((Long) values[row][col]);
+  } else if (values[row][col] instanceof Float) {
+innerWriter.scalar().setDouble((Float) values[row][col]);
+  } else if (values[row][col] instanceof Double) {
+innerWriter.scalar().setDouble((Double) values[row][col]);
+  } else if (values[row][col] instanceof BitSet || 
values[row][col] instanceof Boolean) {
+innerWriter.scalar().setBoolean((Boolean) values[row][col]);
+  } else if (values[row][col] instanceof String) {
+innerWriter.scalar().setString((String) values[row][col]);
+  }
 
 Review comment:
   I don't want to throw an exception here, but I'll send a warning to the 
logger instead. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046120#comment-17046120
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384899948
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1069,26 +1118,30 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
   for (int col = 0; col < values[row].length; col++) {
 assert fieldNames != null;
 currentFieldName = fieldNames.get(col);
-ArrayWriter innerWriter = listWriter.array(currentFieldName);
-if (values[row][col] instanceof Integer) {
-  innerWriter.scalar().setInt((Integer) values[row][col]);
-} else if (values[row][col] instanceof Short) {
-  innerWriter.scalar().setInt((Short) values[row][col]);
-} else if (values[row][col] instanceof Byte) {
-  innerWriter.scalar().setInt((Byte) values[row][col]);
-} else if (values[row][col] instanceof Long) {
-  innerWriter.scalar().setLong((Long) values[row][col]);
-} else if (values[row][col] instanceof Float) {
-  innerWriter.scalar().setDouble((Float) values[row][col]);
-} else if (values[row][col] instanceof Double) {
-  innerWriter.scalar().setDouble((Double) values[row][col]);
-} else if (values[row][col] instanceof BitSet || values[row][col] 
instanceof Boolean) {
-  innerWriter.scalar().setBoolean((Boolean) values[row][col]);
-} else if (values[row][col] instanceof String) {
-  innerWriter.scalar().setString((String) values[row][col]);
-}
-if (col == values[row].length) {
-  innerWriter.save();
+try {
+  ArrayWriter innerWriter = listWriter.array(currentFieldName);
+  if (values[row][col] instanceof Integer) {
+innerWriter.scalar().setInt((Integer) values[row][col]);
+  } else if (values[row][col] instanceof Short) {
+innerWriter.scalar().setInt((Short) values[row][col]);
+  } else if (values[row][col] instanceof Byte) {
+innerWriter.scalar().setInt((Byte) values[row][col]);
+  } else if (values[row][col] instanceof Long) {
+innerWriter.scalar().setLong((Long) values[row][col]);
+  } else if (values[row][col] instanceof Float) {
+innerWriter.scalar().setDouble((Float) values[row][col]);
+  } else if (values[row][col] instanceof Double) {
+innerWriter.scalar().setDouble((Double) values[row][col]);
+  } else if (values[row][col] instanceof BitSet || 
values[row][col] instanceof Boolean) {
+innerWriter.scalar().setBoolean((Boolean) values[row][col]);
+  } else if (values[row][col] instanceof String) {
+innerWriter.scalar().setString((String) values[row][col]);
+  }
+  if (col == values[row].length) {
+innerWriter.save();
+  }
+} catch (Exception e) {
+  logger.warn("Drill does not support maps and lists in HDF5 
Compound fields. Skipping: {}/{}", resolvedPath, currentFieldName);
 
 Review comment:
   Now catching more specific error, specifically 
`TupleWriter.UndefinedColumnException`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046107#comment-17046107
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384895214
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -92,6 +93,18 @@
 
   private static final String LONG_COLUMN_NAME = "long_data";
 
+  private static final String DATA_SIZE_COLUMN_NAME = "data_size";
+
+  private static final String ELEMENT_COUNT_NAME = "element_count";
+
+  private static final String DATASET_DATA_TYPE_NAME = "dataset_data_type";
+
+  private static final String DIMENSIONS_FIELD_NAME = "dimensions";
+
+  private static final int PREVIEW_ROW_LIMIT = 20;
+
+  private static final int MAX_DATASET_SIZE = 16777216; // 16MB
 
 Review comment:
   Fixed
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046106#comment-17046106
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384895142
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -628,8 +672,8 @@ private void writeIntListColumn(TupleWriter rowWriter, 
String name, int[] list)
 }
 
 ScalarWriter arrayWriter = rowWriter.column(index).array().scalar();
-for (int value : list) {
-  arrayWriter.setInt(value);
+for (int i = 0; (i < list.length && i < PREVIEW_ROW_LIMIT); i++) {
 
 Review comment:
   Fixed
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046103#comment-17046103
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384894022
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -946,7 +990,11 @@ private void writeAttributes(TupleWriter rowWriter, 
HDF5DrillMetadata record) {
   writeLongColumn(mapWriter, key, (Long) attrib.getValue());
   break;
 case INT:
-  writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  try {
+writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  } catch (Exception e) {
+logger.warn("{} {}", key, attrib);
 
 Review comment:
   I'm not sure why this try/catch block was here but I removed it entirely.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046035#comment-17046035
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384866829
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -92,6 +93,18 @@
 
   private static final String LONG_COLUMN_NAME = "long_data";
 
+  private static final String DATA_SIZE_COLUMN_NAME = "data_size";
+
+  private static final String ELEMENT_COUNT_NAME = "element_count";
+
+  private static final String DATASET_DATA_TYPE_NAME = "dataset_data_type";
+
+  private static final String DIMENSIONS_FIELD_NAME = "dimensions";
+
+  private static final int PREVIEW_ROW_LIMIT = 20;
+
+  private static final int MAX_DATASET_SIZE = 16777216; // 16MB
 
 Review comment:
   Nit: this can be better expressed as `16 * 1024 * 1024`. Or, since you are 
mimicking the max buffer size, set your constant to 
`ValueVector.MAX_BUFFER_SIZE`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046039#comment-17046039
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384868605
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1069,26 +1118,30 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
   for (int col = 0; col < values[row].length; col++) {
 assert fieldNames != null;
 currentFieldName = fieldNames.get(col);
-ArrayWriter innerWriter = listWriter.array(currentFieldName);
-if (values[row][col] instanceof Integer) {
-  innerWriter.scalar().setInt((Integer) values[row][col]);
-} else if (values[row][col] instanceof Short) {
-  innerWriter.scalar().setInt((Short) values[row][col]);
-} else if (values[row][col] instanceof Byte) {
-  innerWriter.scalar().setInt((Byte) values[row][col]);
-} else if (values[row][col] instanceof Long) {
-  innerWriter.scalar().setLong((Long) values[row][col]);
-} else if (values[row][col] instanceof Float) {
-  innerWriter.scalar().setDouble((Float) values[row][col]);
-} else if (values[row][col] instanceof Double) {
-  innerWriter.scalar().setDouble((Double) values[row][col]);
-} else if (values[row][col] instanceof BitSet || values[row][col] 
instanceof Boolean) {
-  innerWriter.scalar().setBoolean((Boolean) values[row][col]);
-} else if (values[row][col] instanceof String) {
-  innerWriter.scalar().setString((String) values[row][col]);
-}
-if (col == values[row].length) {
-  innerWriter.save();
+try {
+  ArrayWriter innerWriter = listWriter.array(currentFieldName);
+  if (values[row][col] instanceof Integer) {
+innerWriter.scalar().setInt((Integer) values[row][col]);
+  } else if (values[row][col] instanceof Short) {
+innerWriter.scalar().setInt((Short) values[row][col]);
+  } else if (values[row][col] instanceof Byte) {
+innerWriter.scalar().setInt((Byte) values[row][col]);
+  } else if (values[row][col] instanceof Long) {
+innerWriter.scalar().setLong((Long) values[row][col]);
+  } else if (values[row][col] instanceof Float) {
+innerWriter.scalar().setDouble((Float) values[row][col]);
+  } else if (values[row][col] instanceof Double) {
+innerWriter.scalar().setDouble((Double) values[row][col]);
+  } else if (values[row][col] instanceof BitSet || 
values[row][col] instanceof Boolean) {
+innerWriter.scalar().setBoolean((Boolean) values[row][col]);
+  } else if (values[row][col] instanceof String) {
+innerWriter.scalar().setString((String) values[row][col]);
+  }
+  if (col == values[row].length) {
+innerWriter.save();
+  }
+} catch (Exception e) {
+  logger.warn("Drill does not support maps and lists in HDF5 
Compound fields. Skipping: {}/{}", resolvedPath, currentFieldName);
 
 Review comment:
   Are we guaranteed that the only exception is for this one use case? (No NPE, 
OOM, etc.?) Or, should we catch a more specific error?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046040#comment-17046040
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384868442
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1069,26 +1118,30 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
   for (int col = 0; col < values[row].length; col++) {
 assert fieldNames != null;
 currentFieldName = fieldNames.get(col);
-ArrayWriter innerWriter = listWriter.array(currentFieldName);
-if (values[row][col] instanceof Integer) {
-  innerWriter.scalar().setInt((Integer) values[row][col]);
-} else if (values[row][col] instanceof Short) {
-  innerWriter.scalar().setInt((Short) values[row][col]);
-} else if (values[row][col] instanceof Byte) {
-  innerWriter.scalar().setInt((Byte) values[row][col]);
-} else if (values[row][col] instanceof Long) {
-  innerWriter.scalar().setLong((Long) values[row][col]);
-} else if (values[row][col] instanceof Float) {
-  innerWriter.scalar().setDouble((Float) values[row][col]);
-} else if (values[row][col] instanceof Double) {
-  innerWriter.scalar().setDouble((Double) values[row][col]);
-} else if (values[row][col] instanceof BitSet || values[row][col] 
instanceof Boolean) {
-  innerWriter.scalar().setBoolean((Boolean) values[row][col]);
-} else if (values[row][col] instanceof String) {
-  innerWriter.scalar().setString((String) values[row][col]);
-}
-if (col == values[row].length) {
-  innerWriter.save();
+try {
+  ArrayWriter innerWriter = listWriter.array(currentFieldName);
+  if (values[row][col] instanceof Integer) {
+innerWriter.scalar().setInt((Integer) values[row][col]);
+  } else if (values[row][col] instanceof Short) {
+innerWriter.scalar().setInt((Short) values[row][col]);
+  } else if (values[row][col] instanceof Byte) {
+innerWriter.scalar().setInt((Byte) values[row][col]);
+  } else if (values[row][col] instanceof Long) {
+innerWriter.scalar().setLong((Long) values[row][col]);
+  } else if (values[row][col] instanceof Float) {
+innerWriter.scalar().setDouble((Float) values[row][col]);
+  } else if (values[row][col] instanceof Double) {
+innerWriter.scalar().setDouble((Double) values[row][col]);
+  } else if (values[row][col] instanceof BitSet || 
values[row][col] instanceof Boolean) {
+innerWriter.scalar().setBoolean((Boolean) values[row][col]);
+  } else if (values[row][col] instanceof String) {
+innerWriter.scalar().setString((String) values[row][col]);
+  }
 
 Review comment:
   What happens for the "none of the above" case? Should we throw an exception 
to be safe?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046037#comment-17046037
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384867259
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -628,8 +672,8 @@ private void writeIntListColumn(TupleWriter rowWriter, 
String name, int[] list)
 }
 
 ScalarWriter arrayWriter = rowWriter.column(index).array().scalar();
-for (int value : list) {
-  arrayWriter.setInt(value);
+for (int i = 0; (i < list.length && i < PREVIEW_ROW_LIMIT); i++) {
 
 Review comment:
   Nit:
   
   ```
   int maxElements = math.min(list.length, PREVIEW_ROW_LIMIT);
   for (int i = 0; i < maxElements; i++) {
   ```
   
   Here and below.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046036#comment-17046036
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384868211
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1032,11 +1081,11 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
 
   SchemaBuilder innerSchema = new SchemaBuilder();
   MapBuilder mapBuilder = innerSchema.addMap(COMPOUND_DATA_FIELD_NAME);
-
   for (HDF5CompoundMemberInformation info : infos) {
 fieldNames.add(info.getName());
+String compoundColumnDataType = 
info.getType().tryGetJavaType().getSimpleName();
 
-switch (info.getType().tryGetJavaType().getSimpleName()) {
+switch (compoundColumnDataType) {
 
 Review comment:
   Would be more reliable to switch on the type itself. Switch probably won't 
work, so would instead need a chain of `if` statements.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046038#comment-17046038
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r384867611
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -946,7 +990,11 @@ private void writeAttributes(TupleWriter rowWriter, 
HDF5DrillMetadata record) {
   writeLongColumn(mapWriter, key, (Long) attrib.getValue());
   break;
 case INT:
-  writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  try {
+writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  } catch (Exception e) {
+logger.warn("{} {}", key, attrib);
 
 Review comment:
   Is this the right solution? What would cause the exception? If reading 100M 
rows, do we want to emit 100M warnings into the log? Will the user understand 
why the value in some column is null?
   
   Should we be more selective: catching the (one) exception we want to ignore 
and failing for all others?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045540#comment-17045540
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-591427930
 
 
   @paul-rogers Do you approve of this PR?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045517#comment-17045517
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on issue #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-591410602
 
 
   @cgivre don't see @paul-rogers approval, so we should wait for it first.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045515#comment-17045515
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on issue #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-591410602
 
 
   @cgivre don't see that @paul-rogers approval, so we should wait for it first.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045512#comment-17045512
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-591408854
 
 
   @arina-ielchiieva Can we merge this PR now?  Thx
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044574#comment-17044574
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383950814
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -946,7 +990,11 @@ private void writeAttributes(TupleWriter rowWriter, 
HDF5DrillMetadata record) {
   writeLongColumn(mapWriter, key, (Long) attrib.getValue());
   break;
 case INT:
-  writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  try {
+writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  } catch (Exception e) {
+logger.warn("{} {}", key, attrib.toString());
 
 Review comment:
   Fixed
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044570#comment-17044570
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on pull request #1978: DRILL-7578: HDF5 Metadata 
Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383948346
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -946,7 +990,11 @@ private void writeAttributes(TupleWriter rowWriter, 
HDF5DrillMetadata record) {
   writeLongColumn(mapWriter, key, (Long) attrib.getValue());
   break;
 case INT:
-  writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  try {
+writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  } catch (Exception e) {
+logger.warn("{} {}", key, attrib.toString());
 
 Review comment:
   This one.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044554#comment-17044554
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-590913914
 
 
   > Nit: `toString`method will be called anyway, so no need to indicate it.
   @arina-ielchiieva 
   Which line are you referring to?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044534#comment-17044534
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383929237
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1069,26 +1117,30 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
   for (int col = 0; col < values[row].length; col++) {
 assert fieldNames != null;
 currentFieldName = fieldNames.get(col);
-ArrayWriter innerWriter = listWriter.array(currentFieldName);
-if (values[row][col] instanceof Integer) {
-  innerWriter.scalar().setInt((Integer) values[row][col]);
-} else if (values[row][col] instanceof Short) {
-  innerWriter.scalar().setInt((Short) values[row][col]);
-} else if (values[row][col] instanceof Byte) {
-  innerWriter.scalar().setInt((Byte) values[row][col]);
-} else if (values[row][col] instanceof Long) {
-  innerWriter.scalar().setLong((Long) values[row][col]);
-} else if (values[row][col] instanceof Float) {
-  innerWriter.scalar().setDouble((Float) values[row][col]);
-} else if (values[row][col] instanceof Double) {
-  innerWriter.scalar().setDouble((Double) values[row][col]);
-} else if (values[row][col] instanceof BitSet || values[row][col] 
instanceof Boolean) {
-  innerWriter.scalar().setBoolean((Boolean) values[row][col]);
-} else if (values[row][col] instanceof String) {
-  innerWriter.scalar().setString((String) values[row][col]);
-}
-if (col == values[row].length) {
-  innerWriter.save();
+try {
+  ArrayWriter innerWriter = listWriter.array(currentFieldName);
+  if (values[row][col] instanceof Integer) {
+innerWriter.scalar().setInt((Integer) values[row][col]);
+  } else if (values[row][col] instanceof Short) {
+innerWriter.scalar().setInt((Short) values[row][col]);
+  } else if (values[row][col] instanceof Byte) {
+innerWriter.scalar().setInt((Byte) values[row][col]);
+  } else if (values[row][col] instanceof Long) {
+innerWriter.scalar().setLong((Long) values[row][col]);
+  } else if (values[row][col] instanceof Float) {
+innerWriter.scalar().setDouble((Float) values[row][col]);
+  } else if (values[row][col] instanceof Double) {
+innerWriter.scalar().setDouble((Double) values[row][col]);
+  } else if (values[row][col] instanceof BitSet || 
values[row][col] instanceof Boolean) {
+innerWriter.scalar().setBoolean((Boolean) values[row][col]);
+  } else if (values[row][col] instanceof String) {
+innerWriter.scalar().setString((String) values[row][col]);
+  }
+  if (col == values[row].length) {
+innerWriter.save();
+  }
+} catch (Exception e) {
+  logger.warn("Drill does not support maps and lists in HDF5 
Compound fields. Skipping: {}/{}", resolvedPath, currentFieldName);
 
 Review comment:
   Added to docs.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044535#comment-17044535
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383929404
 
 

 ##
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
 testBuilder()
   .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-  .unOrdered()
-  .baselineColumns("path", "data_type", "file_name", "int_data")
-  .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+  .ordered()
 
 Review comment:
   Fixed here and elsewhere.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044537#comment-17044537
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383929709
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -467,6 +503,12 @@ private void projectDataset(RowSetLoader rowWriter, 
String datapath) {
 String fieldName = HDF5Utils.getNameFromPath(datapath);
 IHDF5Reader reader = hdf5Reader;
 HDF5DataSetInformation dsInfo = 
reader.object().getDataSetInformation(datapath);
+
+// If the dataset is larger than 16MB, do not project the dataset
+if (dsInfo.getSize() > 16777216) {
 
 Review comment:
   Added constant.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044536#comment-17044536
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383929588
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -946,7 +988,11 @@ private void writeAttributes(TupleWriter rowWriter, 
HDF5DrillMetadata record) {
   writeLongColumn(mapWriter, key, (Long) attrib.getValue());
   break;
 case INT:
-  writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  try {
+writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  } catch (Exception e) {
+System.out.println(key +  " " + attrib.toString());
 
 Review comment:
   Crap... sorry I missed that.  Removed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044539#comment-17044539
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383929781
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -458,7 +494,7 @@ private void projectMetadataRow(RowSetLoader rowWriter) {
 
   /**
* This function writes one row of data in a metadata query. The number of 
dimensions here is n+1. So if the actual dataset is a 1D column, it will be 
written as a list.
-   * This is function is only called in metadata queries as the schema is not 
known in advance.
+   * This is function is only called in metadata queries as the schema is not 
known in advance.  If the datasize is greater than 16MB, the function does not 
project the dataset
 
 Review comment:
   Fixed
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044532#comment-17044532
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383928859
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1027,16 +1074,17 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
 }
   } else {
 int index = 0;
+List compoundDataTypes = new ArrayList<>();
 
 Review comment:
   Removed this variable.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044531#comment-17044531
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383928727
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -458,7 +494,7 @@ private void projectMetadataRow(RowSetLoader rowWriter) {
 
   /**
* This function writes one row of data in a metadata query. The number of 
dimensions here is n+1. So if the actual dataset is a 1D column, it will be 
written as a list.
-   * This is function is only called in metadata queries as the schema is not 
known in advance.
+   * This is function is only called in metadata queries as the schema is not 
known in advance.  If the datasize is greater than 16MB, the function does not 
project the dataset
 
 Review comment:
   Added to documentation.  
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044499#comment-17044499
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on pull request #1978: DRILL-7578: HDF5 Metadata 
Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383908562
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -467,6 +503,12 @@ private void projectDataset(RowSetLoader rowWriter, 
String datapath) {
 String fieldName = HDF5Utils.getNameFromPath(datapath);
 IHDF5Reader reader = hdf5Reader;
 HDF5DataSetInformation dsInfo = 
reader.object().getDataSetInformation(datapath);
+
+// If the dataset is larger than 16MB, do not project the dataset
+if (dsInfo.getSize() > 16777216) {
 
 Review comment:
   Use constant please instead of number, this would be more readable.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044504#comment-17044504
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on pull request #1978: DRILL-7578: HDF5 Metadata 
Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383911343
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1027,16 +1074,17 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
 }
   } else {
 int index = 0;
+List compoundDataTypes = new ArrayList<>();
 
 Review comment:
   Could you please point out where this list is used in code?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044500#comment-17044500
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on pull request #1978: DRILL-7578: HDF5 Metadata 
Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383908899
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -946,7 +988,11 @@ private void writeAttributes(TupleWriter rowWriter, 
HDF5DrillMetadata record) {
   writeLongColumn(mapWriter, key, (Long) attrib.getValue());
   break;
 case INT:
-  writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  try {
+writeIntColumn(mapWriter, key, (Integer) attrib.getValue());
+  } catch (Exception e) {
+System.out.println(key +  " " + attrib.toString());
 
 Review comment:
   Please remove and do proper error handling.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044501#comment-17044501
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on pull request #1978: DRILL-7578: HDF5 Metadata 
Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383911924
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -458,7 +494,7 @@ private void projectMetadataRow(RowSetLoader rowWriter) {
 
   /**
* This function writes one row of data in a metadata query. The number of 
dimensions here is n+1. So if the actual dataset is a 1D column, it will be 
written as a list.
-   * This is function is only called in metadata queries as the schema is not 
known in advance.
+   * This is function is only called in metadata queries as the schema is not 
known in advance.  If the datasize is greater than 16MB, the function does not 
project the dataset
 
 Review comment:
   Please also add this to documentation.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044502#comment-17044502
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on pull request #1978: DRILL-7578: HDF5 Metadata 
Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383909824
 
 

 ##
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
 testBuilder()
   .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-  .unOrdered()
-  .baselineColumns("path", "data_type", "file_name", "int_data")
-  .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+  .ordered()
 
 Review comment:
   Better use unordered when you don't have order by clause in the query.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044498#comment-17044498
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on pull request #1978: DRILL-7578: HDF5 Metadata 
Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383908163
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -458,7 +494,7 @@ private void projectMetadataRow(RowSetLoader rowWriter) {
 
   /**
* This function writes one row of data in a metadata query. The number of 
dimensions here is n+1. So if the actual dataset is a 1D column, it will be 
written as a list.
-   * This is function is only called in metadata queries as the schema is not 
known in advance.
+   * This is function is only called in metadata queries as the schema is not 
known in advance.  If the datasize is greater than 16MB, the function does not 
project the dataset
 
 Review comment:
   ```suggestion
  * This is function is only called in metadata queries as the schema is 
not known in advance. If the datasize is greater than 16MB, the function does 
not project the dataset
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044503#comment-17044503
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

arina-ielchiieva commented on pull request #1978: DRILL-7578: HDF5 Metadata 
Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r383910324
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1069,26 +1117,30 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
   for (int col = 0; col < values[row].length; col++) {
 assert fieldNames != null;
 currentFieldName = fieldNames.get(col);
-ArrayWriter innerWriter = listWriter.array(currentFieldName);
-if (values[row][col] instanceof Integer) {
-  innerWriter.scalar().setInt((Integer) values[row][col]);
-} else if (values[row][col] instanceof Short) {
-  innerWriter.scalar().setInt((Short) values[row][col]);
-} else if (values[row][col] instanceof Byte) {
-  innerWriter.scalar().setInt((Byte) values[row][col]);
-} else if (values[row][col] instanceof Long) {
-  innerWriter.scalar().setLong((Long) values[row][col]);
-} else if (values[row][col] instanceof Float) {
-  innerWriter.scalar().setDouble((Float) values[row][col]);
-} else if (values[row][col] instanceof Double) {
-  innerWriter.scalar().setDouble((Double) values[row][col]);
-} else if (values[row][col] instanceof BitSet || values[row][col] 
instanceof Boolean) {
-  innerWriter.scalar().setBoolean((Boolean) values[row][col]);
-} else if (values[row][col] instanceof String) {
-  innerWriter.scalar().setString((String) values[row][col]);
-}
-if (col == values[row].length) {
-  innerWriter.save();
+try {
+  ArrayWriter innerWriter = listWriter.array(currentFieldName);
+  if (values[row][col] instanceof Integer) {
+innerWriter.scalar().setInt((Integer) values[row][col]);
+  } else if (values[row][col] instanceof Short) {
+innerWriter.scalar().setInt((Short) values[row][col]);
+  } else if (values[row][col] instanceof Byte) {
+innerWriter.scalar().setInt((Byte) values[row][col]);
+  } else if (values[row][col] instanceof Long) {
+innerWriter.scalar().setLong((Long) values[row][col]);
+  } else if (values[row][col] instanceof Float) {
+innerWriter.scalar().setDouble((Float) values[row][col]);
+  } else if (values[row][col] instanceof Double) {
+innerWriter.scalar().setDouble((Double) values[row][col]);
+  } else if (values[row][col] instanceof BitSet || 
values[row][col] instanceof Boolean) {
+innerWriter.scalar().setBoolean((Boolean) values[row][col]);
+  } else if (values[row][col] instanceof String) {
+innerWriter.scalar().setString((String) values[row][col]);
+  }
+  if (col == values[row].length) {
+innerWriter.save();
+  }
+} catch (Exception e) {
+  logger.warn("Drill does not support maps and lists in HDF5 
Compound fields. Skipping: {}/{}", resolvedPath, currentFieldName);
 
 Review comment:
   Please ensure that `contrib/format-hdf5/README.md` contains this 
information, maybe in Limitations section.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042165#comment-17042165
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-589833387
 
 
   @paul-rogers 
   I believe I've addressed the issues and this should be ready for final 
review.  Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042162#comment-17042162
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-589833387
 
 
   @paul-rogers 
   I think this is ready for the second pass at review.  Thx
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041939#comment-17041939
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r382629907
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -92,6 +93,20 @@
 
   private static final String LONG_COLUMN_NAME = "long_data";
 
+  private static final String DATA_SIZE_COLUMN_NAME = "data_size";
+
+  private static final String ELEMENT_COUNT_NAME = "element_count";
+
+  private static final String IS_TIMESTAMP_NAME = "is_timestamp";
 
 Review comment:
   @paul-rogers 
   Since there is a column reporting the actual HDF5 data type, I removed both 
these columns.  If there is demand, I'll implement the interval data type.  
Timestamp is already implemented.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040324#comment-17040324
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r381468457
 
 

 ##
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
 testBuilder()
   .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-  .unOrdered()
-  .baselineColumns("path", "data_type", "file_name", "int_data")
-  .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+  .ordered()
 
 Review comment:
   I'm guessing that this isn't for this PR either? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039528#comment-17039528
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380982462
 
 

 ##
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
 testBuilder()
   .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-  .unOrdered()
-  .baselineColumns("path", "data_type", "file_name", "int_data")
-  .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+  .ordered()
 
 Review comment:
   Agreed. The point is not so much to change what you have here. Rather, it is 
to learn what more Drill should provide. Some other random examples of nested 
schemas are 1) REST calls, 2) Zip archives, 3) Iceberg files, 4) the JDBC 
storage plugin.
   
   I'm sure these are all done either not at all (Zip, Iceberg) or via ad-hoc 
code (JDBC). The lesson is that Drill probably needs a way to support nested 
schemas; not all data sources are Parquet or JSON files on HDFS.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039432#comment-17039432
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380916913
 
 

 ##
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
 testBuilder()
   .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-  .unOrdered()
-  .baselineColumns("path", "data_type", "file_name", "int_data")
-  .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+  .ordered()
 
 Review comment:
   In retrospect, the best way would probably be to do some sort of hybrid 
plugin that has access to the Calcite schema. That way you could add the HDF5 
data path after the file.  If there isn't a data path.. you get metadata only.  
However, I suspect that would be very difficult with the current architecture 
and would involve a lot of cut/paste and or extending from the file system 
plugin.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039429#comment-17039429
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380915865
 
 

 ##
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
 testBuilder()
   .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-  .unOrdered()
-  .baselineColumns("path", "data_type", "file_name", "int_data")
-  .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+  .ordered()
 
 Review comment:
   @paul-rogers 
   Thanks as always for your responses.   I don't know that I'd use this plugin 
as an example of much, just because I don't think there are many other data 
formats like it. (Maybe I'm wrong about that).
   
   I do think we should look at the LTSV and other EVF conversions and make 
incremental improvements to an "Easy EVF" interface that will reduce cut/paste 
code in format plugins and make them easier to write.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039387#comment-17039387
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380885317
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -92,6 +93,20 @@
 
   private static final String LONG_COLUMN_NAME = "long_data";
 
+  private static final String DATA_SIZE_COLUMN_NAME = "data_size";
+
+  private static final String ELEMENT_COUNT_NAME = "element_count";
+
+  private static final String IS_TIMESTAMP_NAME = "is_timestamp";
 
 Review comment:
   @cgivre, maybe just add the `extended_type` column (or `hdf5_type`) to tell 
the user the actual data type, leaving the `data_type` to provide the converted 
Drill type. This will work whether the code actually supports time/interval 
conversion or not. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039386#comment-17039386
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380884441
 
 

 ##
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
 testBuilder()
   .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-  .unOrdered()
-  .baselineColumns("path", "data_type", "file_name", "int_data")
-  .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+  .ordered()
 
 Review comment:
   @cgivre, you are right about my lack of understanding. This thing is a beast 
and I'm afraid I don't understand much of it. The config variable is also a 
hack: it means the user needs a separate config for each file and data set. The 
ease of use of that is perhaps not what it should be. Having a session option 
would be a bit better, but still not great. (Of course, Drill provides no way 
for a plugin to define a session option. Another good feature request.)
   
   I posted the question on the dev list and learned that we probably can 
modify Drill to allow you to specify the data set name as part of the table 
path. And, we can work out how to extend the `DESCRIBE` statement to query for 
schema rather than data.
   
   All of this is beyond the scope of this PR. If we use HDF5 as a prototype, 
we can see that we'd want to do similar things for other format or plugins. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039106#comment-17039106
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380694846
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1069,26 +1125,30 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
`  for (int col = 0; col < values[row].length; col++) {
 assert fieldNames != null;
 currentFieldName = fieldNames.get(col);
-ArrayWriter innerWriter = listWriter.array(currentFieldName);
-if (values[row][col] instanceof Integer) {
-  innerWriter.scalar().setInt((Integer) values[row][col]);
-} else if (values[row][col] instanceof Short) {
-  innerWriter.scalar().setInt((Short) values[row][col]);
-} else if (values[row][col] instanceo& Byte) {
-  innerWriter.scalar().setInt((Byte) values[row][col]);
-} else if (values[row][col] instanceof Long) {
-  innerWriter.scalar().setLong((Long) values[row][col]);
-} else if (values[row][col] instanceof Float) {
-  innerWriter.scalar().setDouble((Float) values[row][col]);
-} else if (values[row][col] instanceof Double) {
-  innerWriter.scalar().setDouble((Double) values[row][col]);
-} else if (values[row][col] instanceof BitSet || values[row][col] 
instanceof Boolean) {
-  innerWriter.scalar().setBoolean((Boolean) values[row][col]);
-} else if (values[row][col] instanceof String) {
-  innerWriter.scalar().setString((String) values[row][col]);
-}
-if (col == values[row].length) {
-  innerWriter.save();
+try {
+  ArrayWriter innerWriter = listWriter.array(currentFieldName);
+  if (values[row][col] instanceof Integer) {
 
 Review comment:
   Unfortunately, the HDF5 library doesn't provide a way to get primitives out 
of compound data types, however I do think there is a way to reduce the number 
of casts/boxing/unboxing in this section.  
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039100#comment-17039100
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380683693
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -92,6 +93,20 @@
 
   private static final String LONG_COLUMN_NAME = "long_data";
 
+  private static final String DATA_SIZE_COLUMN_NAME = "data_size";
+
+  private static final String ELEMENT_COUNT_NAME = "element_count";
+
+  private static final String IS_TIMESTAMP_NAME = "is_timestamp";
 
 Review comment:
   @paul-rogers 
   I debated adding these.  At the moment, Drill/HDF5 does support reading 
`TIMESTAMP` columns as `TIMESTAMPS`.  I wasn't aware of intervals as an HDF5 
data type so the current implementation doesn't support that. 
   
   I can create a PR to support Intervals and work on it if there is 
usage/demand for it.  
   Do you think I should remove these columns from the metadata view?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039096#comment-17039096
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380681160
 
 

 ##
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
 testBuilder()
   .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-  .unOrdered()
-  .baselineColumns("path", "data_type", "file_name", "int_data")
-  .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+  .ordered()
 
 Review comment:
   Hey @paul-rogers 
   I'd agree that this is a bit of a hack, but I think you misunderstood how 
this plugin works.  There is a config variable called `defaultPath` which when 
`null` returns metadata.  If this variable is set to an HDF5 path, you get the 
data (and no metadata). 
   
   The issue is the rather unique nature of HDF5, as a file system within a 
file.  I do like the idea of treating the file as directory however, the 
metadata is really useful to the user in that some file paths are shortcuts, 
some are groups etc.   Also, datasets have attributes which can be useful. 
   
   IMHO, this really blends the lines between storage and format plugin, so 
it's quite challenging to design this.  I am curious as to whether this 
actually gets used in the scientific community.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038855#comment-17038855
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-587322871
 
 
   @cgivre, one more design-level comment about this particular file format. 
You've mentioned several times that HDF5 is "a file system within a file." It 
finally clicked: we need need to treat this file as a directory, not a file. 
This means adding a layer of schema in Calcite planning:
   
   ```
   SELECT * FROM `dfs`.`some/path/myFile.hdf5`.`dataSet1`
   ```
   
   This would let the reader load only data from `dataSet1`, using only the 
schema from that data set.
   
   (Can't use slashes; that is a notation for the Hadoop file system.)
   
   Fortunately, Calcite seems to allow any number of schema levels. It is why 
we can have plugins, workspaces, etc. The challenge is to provide some way for 
a format plugin to influence the planner and say, "hey, if you do a query 
against me, ask me to resolve all path elements below my file name."
   
   Again, not something for this PR. But, it is something we can think about as 
we try to improve our storage plugin API.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038834#comment-17038834
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380490693
 
 

 ##
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
 testBuilder()
   .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-  .unOrdered()
-  .baselineColumns("path", "data_type", "file_name", "int_data")
-  .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+  .ordered()
 
 Review comment:
   This might be the place to ask the question about schema. We have two 
distinct views of a data set. The general rule of the wildcard (`*`) is to 
return all available columns. Here, we special-case wildcard to mean "return 
metadata." This is, unfortunately, very non standard.
   
   We need some way to express two views of the file. The same problem occurs 
for any database. We could even use if for JSON, CSV and other file formats.
   
   The challenge is, how do we tell the query we want metadata and not data? In 
a normal DB, we query system tables. Perhaps we could jimmy up something in 
Drill:
   
   ```
   SELECT * FROM sys.schema.dfs.`hdf5/dset.h5`
   ```
   
   Or, maybe think of the table as a namespace, and have an optional `.schema` 
tail:
   
   ```
   SELECT * FROM dfs.`hdf5/dset.h5`.schema
   ```
   
   The point is not for you to implement this, or even to design the solution. 
Rather, the point is that the current solution is a hack, and that we need a 
better solution.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038835#comment-17038835
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380488669
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1069,26 +1125,30 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
   for (int col = 0; col < values[row].length; col++) {
 assert fieldNames != null;
 currentFieldName = fieldNames.get(col);
-ArrayWriter innerWriter = listWriter.array(currentFieldName);
-if (values[row][col] instanceof Integer) {
-  innerWriter.scalar().setInt((Integer) values[row][col]);
-} else if (values[row][col] instanceof Short) {
-  innerWriter.scalar().setInt((Short) values[row][col]);
-} else if (values[row][col] instanceof Byte) {
-  innerWriter.scalar().setInt((Byte) values[row][col]);
-} else if (values[row][col] instanceof Long) {
-  innerWriter.scalar().setLong((Long) values[row][col]);
-} else if (values[row][col] instanceof Float) {
-  innerWriter.scalar().setDouble((Float) values[row][col]);
-} else if (values[row][col] instanceof Double) {
-  innerWriter.scalar().setDouble((Double) values[row][col]);
-} else if (values[row][col] instanceof BitSet || values[row][col] 
instanceof Boolean) {
-  innerWriter.scalar().setBoolean((Boolean) values[row][col]);
-} else if (values[row][col] instanceof String) {
-  innerWriter.scalar().setString((String) values[row][col]);
-}
-if (col == values[row].length) {
-  innerWriter.save();
+try {
+  ArrayWriter innerWriter = listWriter.array(currentFieldName);
+  if (values[row][col] instanceof Integer) {
 
 Review comment:
   I realize that this is existing code, but boxing and comparing each value 
will be slow and will thrash the heap. Far better if we can use "shims" that 
can read the data as the Java primitive type and write it directly to the 
corresponding `set()` method without boxing.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038836#comment-17038836
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380487624
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -92,6 +93,20 @@
 
   private static final String LONG_COLUMN_NAME = "long_data";
 
+  private static final String DATA_SIZE_COLUMN_NAME = "data_size";
+
+  private static final String ELEMENT_COUNT_NAME = "element_count";
+
+  private static final String IS_TIMESTAMP_NAME = "is_timestamp";
 
 Review comment:
   The two `is` columns appear mutually exclusive. I wonder, does it make sense 
to define an `extended_type` column if `data_type` is the Drill type? That is, 
for most columns, `extended_type` would be null. For these two it would be, say 
`TIMESTAMP` or `TIME_DURATION`. Though, truth be told, Drill has `TIMESTAMP` 
and `INTERVAL` columns, so if we mapped the HDF5 type to these Drill types, we 
would not need the extended type (or these two Boolean columns).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038762#comment-17038762
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-587259759
 
 
   @paul-rogers @vvysotskyi 
   See above comment.  I removed the config option and added logger warnings if 
the data is truncated.  Again, this is just for "preview" mode so real data 
queries are not affected.  
   In doing this PR, I discovered that the HDF5 format allows for arrays within 
compound fields. 
   
   This functionality is not supported by Drill so I added a warning for that.  
In the future, or if anyone asks for it, I may add it but for now, I'm leaving 
that alone.  
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036389#comment-17036389
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585893378
 
 
   @cgivre, sounds like a good plan.
   
   By the way, it occurred to me that the original "preview" idea may have 
another issue. Drill is SQL-based; clients work best when a column has a 
specific type. On the other hand, if HDF5 is a file system, and we want to 
preview files, each file may have a different kind of data: records in one, 
strings in another, a matrix in a third. If we try to write each of these into 
a "preview" column, not only do we have a size issue, we also have a type 
issue: all of these examples are different types.
   
   On your laptop, the OS gives you a preview of each file. The common 
denominator is the graphic tile which might be a tiny version of an image or a 
video frame, might be an app icon, or whatever. Point is, the OS converts all 
the many file types into a common format: the preview tile.
   
   If the "metadata" view scans all files, the preview can be huge (the entire 
HDF5 content, perhaps.) A preview should be small. Rendering a tile may not be 
super helpful. But, perhaps a brief text representation. For your bit column, 
maybe: "[[123456.78, 98745.43, ...".
   
   Your proposed solution of using HDF5-provided metadata is also good, it 
makes your metadata query more like an "ls -l" equivalent. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036175#comment-17036175
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585729825
 
 
   @paul-rogers , @vvysotskyi 
   I was looking at the HDF5 library and it turns out that there is metadata 
available for the datasets.  Therefore I think I'm going to do three things 
after removing the config option:
   
   1.  Add new columns to the metadata view with the dataset information (size, 
data type, dimensions)
   2.  Check the data size prior to projection and truncate it if it is too 
big. 
   3.  Update the documentation. 
   
   Does this sound like a good plan?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035932#comment-17035932
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585561327
 
 
   As it turns out, you'll get the same error if you try to do a normal query 
on the data. So, we need a rule for what to do if a column is too large even at 
runtime. You can fail the query (as now), you can truncate the data, you can 
split the data into multiple rows.
   
   Here is where it would be super-helpful if Drill supported normal SQL 
warnings: then we could truncate the data when it is optional (as here), and 
issue a warning. Sigh.
   
   Working around a data size limitation is not a good use for a plugin config 
option. The user would need a different config for each file.
   
   Can you, instead, support some special column that is the preview? Don't 
include the preview column in SELECT *. (I think there is an EVF trick for 
that, I'll have to refresh my memory. I think I added it for one of your other 
plugins.) If the user wants the preview, they can ask `SELECT name, preview 
FROM ...`.
   
   Otherwise, the error is correct: the data is too large for a Drill column.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035900#comment-17035900
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585540955
 
 
   @paul-rogers 
   I don't really think it's useful to return that much data and that was why I 
added the config option to disable it.  I do think having a data preview can be 
useful but when you get a large dataset, the utility diminishes pretty quickly.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035898#comment-17035898
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585540050
 
 
   @cgivre, thanks for the stack traces. One of the annoying aspects of Drill 
errors are the first two chunks of traces: they will always be the same as they 
are on the client side.
   
   The 16MB error is basically saying that we are writing more than 16MB for a 
single column in a single row. This is generally Not a Good Thing.
   
   I guess I'm surprised that you are able to get that error, however. You said 
you are writing `INT` values. A batch can have at most 32K values. 32K * 4 = 
128K. If you have `BIGINT`, it is 256K. The same is true if you write a 
`DOUBLE`.
   
   Looks like your code may be writing an array given the "doubleMatrixHelper" 
name. If so, then you can get to 16 MB if you write a square matrix with more 
than sqrt(16M/8) ~= 1400 elements. Are you?
   
   The problem with such large arrays are several. First, clients can't consume 
them; they have to be flattened. You will get 16M / 8 = 2M rows as a result. 
Second, allocating buffers larger than 16 MB will fragment memory.
   
   We can provide an option to disable the 16MB limit. But, this is using a 
table saw without the guard: it will work some times, will cause mayhem other 
times. Does not help with the other issue: that an xDBC client can't really 
consume that volume of data.
   
   What is the use case here to a) show such large volumes of data in a schema 
view, and b) retrieve that volume of data even in a data query?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035595#comment-17035595
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585353181
 
 
   Here are the stack traces:
   ```
   apache drill> select *
   2..semicolon> from dfs.test.`eFitOut.h5`;
   Error: RESOURCE ERROR: One or more nodes ran out of memory while executing 
the query.
   
   A single column value is larger than the maximum allowed size of 16 MB
   Fragment 0:0
   
   [Error Id: 8f722576-9dfc-4462-8acd-dcba76815f37 on localhost:31010] 
(state=,code=0)
   java.sql.SQLException: RESOURCE ERROR: One or more nodes ran out of memory 
while executing the query.
   
   A single column value is larger than the maximum allowed size of 16 MB
   Fragment 0:0
   
   [Error Id: 8f722576-9dfc-4462-8acd-dcba76815f37 on localhost:31010]
   
   
at 
org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:537)
at 
org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(DrillCursor.java:609)
at 
org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:1278)
at 
org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:58)
at 
org.apache.calcite.avatica.AvaticaConnection$1.execute(AvaticaConnection.java:667)
at 
org.apache.drill.jdbc.impl.DrillMetaImpl.prepareAndExecute(DrillMetaImpl.java:1102)
at 
org.apache.drill.jdbc.impl.DrillMetaImpl.prepareAndExecute(DrillMetaImpl.java:1113)
at 
org.apache.calcite.avatica.AvaticaConnection.prepareAndExecuteInternal(AvaticaConnection.java:675)
at 
org.apache.drill.jdbc.impl.DrillConnectionImpl.prepareAndExecuteInternal(DrillConnectionImpl.java:200)
at 
org.apache.calcite.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:156)
at 
org.apache.calcite.avatica.AvaticaStatement.execute(AvaticaStatement.java:217)
at sqlline.Commands.executeSingleQuery(Commands.java:1054)
at sqlline.Commands.execute(Commands.java:1003)
at sqlline.Commands.sql(Commands.java:967)
at sqlline.SqlLine.dispatch(SqlLine.java:734)
at sqlline.SqlLine.begin(SqlLine.java:541)
at sqlline.SqlLine.start(SqlLine.java:267)
at sqlline.SqlLine.main(SqlLine.java:206)
   Caused by: org.apache.drill.common.exceptions.UserRemoteException: RESOURCE 
ERROR: One or more nodes ran out of memory while executing the query.
   
   A single column value is larger than the maximum allowed size of 16 MB
   Fragment 0:0
   
   [Error Id: 8f722576-9dfc-4462-8acd-dcba76815f37 on localhost:31010]
   
   
at 
org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:125)
at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:422)
at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:96)
at 
org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:273)
at 
org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:243)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
at 
io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:312)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
at 

[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035574#comment-17035574
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585347032
 
 
   @paul-rogers 
   Here's what's happening:
   ```
   apache drill> select *
   2..semicolon> from dfs.test.`eFitOut.h5`;
   Error: RESOURCE ERROR: One or more nodes ran out of memory while executing 
the query.
   
   A single column value is larger than the maximum allowed size of 16 MB
   Fragment 0:0
   
   [Error Id: 1d3f37ca-e7e3-48f5-9c7d-cb24bd494a67 on localhost:31010] 
(state=,code=0)
   ```
   When I dug into this, I found that it was one of the dataset columns that 
has a single-cell value greater than 16GB.  This PR basically disables the 
reader from attempting to retrieve the datasets and then we avoid the whole 
issue.  
   
   What made me do this also is even if you only select other columns, you 
still get the error: 
   ```
   apache drill> select path, data_type
   2..semicolon> from dfs.test.`eFitOut.h5`;
   Error: RESOURCE ERROR: One or more nodes ran out of memory while executing 
the query.
   
   A single column value is larger than the maximum allowed size of 16 MB
   Fragment 0:0
   
   [Error Id: 1af14cb0-9bce-488a-9d2d-aca5736670e3 on localhost:31010] 
(state=,code=0)
   apache drill>
   ```
   
   So to conclude:
   1.   There may be a bug in the EVF projection with large fields.  (I don't 
know...)
   2.  This PR fixes the issue for HDF5 by removing the datasets from the 
metadata view.
   
   I should note that when the datasets are not projected in the metadata view, 
the queries execute without issues.
   

   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035566#comment-17035566
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585341741
 
 
   A bit confused by the crash on 16MB part. The problem description is vague. 
Is there a stack trace somewhere?
   
   EVF is designed to limit individual vectors to 16MB. Once you hit that size, 
EVF does an "overflow" move: it copies the last record (the one that does not 
fit) into a new batch, then tell you to return the now-full batch.
   
   If you are seeing a crash, it could be that there is a bit in the overflow 
logic. (That logic is quite complex.) The proper fix, then, would be for me to 
find and fix that bug.
   
   Regarding projection: yes, EVF handles projection. You can ask for writers 
for all your columns, EVF gives you a "dummy" writer for those that are not 
projected. While top-level columns can be handled by a plugin easily (just set 
some flags, say), nested columns are very hard to implement in the plugin. EVF 
provides a uniform way to handle projection at all levels. And, for top level 
arrays such as `column`, EVF also handles per-element projection.
   
   As a result, the only difference between EVF-based projection and 
roll-you-own is that, with EVF, the easiest path is to read the data, give it 
to the column writer, and let the column writer throw it away. Works well for 
sequential formats such as JSON and CSV.
   
   If your format is random-access (you have to request each column, as in 
Parquet), then it is better to ask if the column is projected. But, if your 
data structure is nested, you have to do this at each level.
   
   So, with that explanation out of the way, what about EVF projection is not 
working the way roll-your-own did? Let's figure that out and fix it.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035472#comment-17035472
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585272602
 
 
   @vvysotskyi 
   Thanks for your responses and feedback.   Maybe it would help if you saw how 
this worked:
   
   ```
   apache drill> select *
   2..semicolon> from dfs.test.`dset.h5`;
   
+---+---+---+--+
   | path  | data_type | file_name | int_data   
  |
   
+---+---+---+--+
   | /dset | DATASET   | dset.h5   | 
[[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]] |
   
+---+---+---+--+
   ```
   The column that is in question is the `int_data` column.  What's happening 
is if that column is greater than 16MB, Drill runs out of memory.  If a user is 
actually interested in analyzing that data, they should be using a data query 
which looks like this:
   ```
   apache drill> SELECT * FROM table(dfs.test.`dset.h5` (type => 'hdf5', 
defaultPath => '/dset'));
   +---+---+---+---+---+---+
   | int_col_0 | int_col_1 | int_col_2 | int_col_3 | int_col_4 | int_col_5 |
   +---+---+---+---+---+---+
   | 1 | 2 | 3 | 4 | 5 | 6 |
   | 7 | 8 | 9 | 10| 11| 12|
   | 13| 14| 15| 16| 17| 18|
   | 19| 20| 21| 22| 23| 24|
   +---+---+---+---+---+---+
   4 rows selected (0.172 seconds)
   ```
   My point here is that there's really no practical purpose for using this 
data in a query other than perhaps to see if it is populated at all.  A user 
certainly wouldn't want to aggregate it.
   
   All that is happening in this PR is changing the default behavior in the 
metadata view.  I set the default to `false` so that a user would explicitly 
have to enable this capability. 
   
   If you're not liking this option, what would you suggest as an alternative.  
I don't think the current situation is ideal where Drill runs out of memory and 
crashes...
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035467#comment-17035467
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585272602
 
 
   @vvysotskyi 
   Thanks for your responses and feedback.   Maybe it would help if you saw how 
this worked:
   
   ```
   apache drill> select *
   2..semicolon> from dfs.test.`dset.h5`;
   
+---+---+---+--+
   | path  | data_type | file_name | int_data   
  |
   
+---+---+---+--+
   | /dset | DATASET   | dset.h5   | 
[[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]] |
   
+---+---+---+--+
   ```
   The column that is in question is the `int_data` column.  What's happening 
is if that column is greater than 16MB, Drill runs out of memory.  If a user is 
actually interested in analyzing that data, they should be using a data query 
which looks like this:
   ```
   apache drill> SELECT * FROM table(dfs.test.`dset.h5` (type => 'hdf5', 
defaultPath => '/dset'));
   +---+---+---+---+---+---+
   | int_col_0 | int_col_1 | int_col_2 | int_col_3 | int_col_4 | int_col_5 |
   +---+---+---+---+---+---+
   | 1 | 2 | 3 | 4 | 5 | 6 |
   | 7 | 8 | 9 | 10| 11| 12|
   | 13| 14| 15| 16| 17| 18|
   | 19| 20| 21| 22| 23| 24|
   +---+---+---+---+---+---+
   4 rows selected (0.172 seconds)
   ```
   
   All that is happening in this PR is changing the default behavior in the 
metadata view.  I set the default to `false` so that a user would explicitly 
have to enable this capability. 
   
   If you're not liking this option, what would you suggest as an alternative.  
I don't think the current situation is ideal where Drill runs out of memory and 
crashes...
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035445#comment-17035445
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

vvysotskyi commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585260482
 
 
   @cgivre, in your previous comment, you have mentioned that when plugin 
itself handled projection pushdown, this issue wasn't observed, so I assumed 
that it wasn't done by EVF (at least in this case).
   
   Regarding user experience, I doubt that it would be better for user 
experience to introduce this config option, it just confuses the user: `SELECT 
*` should return all the records, but it will ignore specific column. What 
behavior is expected if the user would specify this column explicitly, or if he 
would add a filter with this column, or even aggregation?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035426#comment-17035426
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585250445
 
 
   @vvysotskyi 
   I was under the impression that the EVF handled pushdown project 
automatically.  @paul-rogers can you comment?
   
   Regardless, if I was to add pushdown projection to this plugin, it would 
still cause an issue in that `SELECT *`  metadata queries would still fail for 
large files.  Whereas I think there would be a better user experience having 
the config option which, when enabled, would make `SELECT *` queries work. 

 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035362#comment-17035362
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

vvysotskyi commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585212808
 
 
   @cgivre, thanks for the explanation. The easiest way is not always the best 
way.
   In this case, I think we should add support for the projection pushdown. As 
an example, you may use the Avro format plugin, it also uses EVF, but supports 
projection pushdown.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035330#comment-17035330
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585184636
 
 
   @vvysotskyi Let me give you some context.. 
   This plugin has two ways of interacting with HDF5 files: metadata queries 
and dataset queries.  HDF5 is like a filesystem within a file, so it can 
contain many datasets.  The dataset query looks at a specific dataset and 
projects the columns and rows as you would expect.
   
   Metadata queries are intended to explore the HDF5 itself rather than an 
individual dataset.  As currently implemented, in metadata queries, the plugin 
will return the filename, paths, dataset types, from the HDF5 file.  Here's 
where the problem arose... The metadata query also maps each dataset to a cell 
in each row.  This is useful because the user gets a preview of the data that 
is actually in each dataset, however if that dataset is larger than 16MB, Drill 
crashes.  When I originally implemented this (before EVF) this wasn't an issue 
because the plugin itself handled pushdown projection, and therefore all the 
user had to do was exclude the dataset from the query.  However, with EVF it 
doesn't work that way. 
   
   Therefore options are:
   1.   Remove this preview functionality entirely 
   2.  Select some small amount from each dataset and project that in a 
metadata query 
   3.  Add a config option to not generate the preview columns in metadata 
querires. 
   4.  Convert preview to a string and truncate at size limit. 
   
   Of these options, option 3 felt the easiest and most useful to me as it 
preserved the functionality and gave the users a way to make it work. 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035231#comment-17035231
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

vvysotskyi commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585138433
 
 
   I don't think that it is a good idea to add format config to ignore a 
specific column for star queries. If the user wants to see specific columns, he 
should specify them in the query.
   
   By the way, if data is too large, is it possible to read it by batches?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

2020-02-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034610#comment-17034610
 ] 

ASF GitHub Bot commented on DRILL-7578:
---

cgivre commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978
 
 
   # [DRILL-7578](https://issues.apache.org/jira/browse/DRILL-7578): HDF5 
Metadata Queries Fail with Large Files
   
   ## Description
   HDF5 metadata queries were failing for files with large datasets.  This PR 
adds a configuration option to avoid projecting the entire dataset in metadata 
queries and hence solves this issue.
   
   ## Documentation
   Added this to the HDF5 documentation:
   
   * `projectDatasetInMetadataQuery`:  This causes the plugin to project the 
data set in metadata queries.  Should be `false` for large files.
   
   ## Testing
   Added new unit test to cover this configuration.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> ---
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)