[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135267#comment-17135267
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre merged pull request #2069:
URL: https://github.com/apache/drill/pull/2069


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135197#comment-17135197
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r439841135



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,

Review comment:
   @vvysotskyi 
   Enum refactored and commits squashed.  Are we good to go?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135165#comment-17135165
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

vvysotskyi commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r439831855



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,

Review comment:
   It is ok to have such a column name, my objection was only related to 
the enum element names.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135162#comment-17135162
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r439830508



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,

Review comment:
   @vvysotskyi 
   Ok.. I see what you're getting at.  I'll refactor the `enum` with a value.  
I misunderstood what you were asking.  In any event, do you have any objection 
to the actual column name starting with an underscore?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135095#comment-17135095
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

vvysotskyi commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r439809024



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,

Review comment:
   Adding a field to enum wouldn't make things too complex, using enums in 
such way is a common practice. Regarding naming fields in other projects, the 
specific project has its own code-style rules.
   
   If you have a strong objection to this, please merge the PR as it is, I've 
just expressed my opinion about this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-06-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134318#comment-17134318
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r439499328



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,

Review comment:
   @vvysotskyi 
   To clarify, the fields here are metadata fields and I wanted to make sure 
that they do not conflict with fields that might actually be in the users' 
data.  It is a fairly standard practice to begin metadata fields with an `_`.  
I've seen this done in many other platforms.  
   
   In any event, how would you like me to proceed with this?
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117357#comment-17117357
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

vvysotskyi commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r430079644



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,

Review comment:
   Please use naming according to JCC.

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {

Review comment:
   ```suggestion
 private enum IMPLICIT_STRING_COLUMN {
   ```

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -420,15 +510,57 @@ private void addColumnToArray(TupleWriter rowWriter, 
String name, TypeProtos.Min
 }
   }
 
+  private void addMetadataWriters() {
+for (String colName : IMPLICIT_STRING_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.VARCHAR);
+}
+for (String colName : IMPLICIT_TIMESTAMP_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.TIMESTAMP);
+}
+  }
+
+  private void addMetadataColumnsToArray(TupleWriter rowWriter, String name, 
MinorType type) {
+int index = rowWriter.tupleSchema().index(name);
+if (index == -1) {
+  ColumnMetadata colSchema = MetadataUtils.newScalar(name, type, 
TypeProtos.DataMode.OPTIONAL);
+  colSchema.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
+  index = rowWriter.addColumn(colSchema);
+} else {
+  return;
+}
+metadataColumnWriters.add(rowWriter.scalar(index));
+  }
+
+  private void writeMetadata() {
+for (int i = 0; i < metadataColumnWriters.size(); i++) {

Review comment:
   I meant something like this:
   ```
   for (IMPLICIT_STRING_COLUMNS column : IMPLICIT_STRING_COLUMNS.values()) {
 String value = stringMetadata.get(column);
 int index = column.ordinal();
 if (value == null) {
   metadataColumnWriters.get(index).setNull();
 } else {
   metadataColumnWriters.get(index).setString(value);
 }
   }
   
   for (IMPLICIT_TIMESTAMP_COLUMNS column : 
IMPLICIT_TIMESTAMP_COLUMNS.values()) {
 Date timeValue = dateMetadata.get(column);
 int index = column.ordinal() + IMPLICIT_STRING_COLUMNS.values().length;
 if (timeValue == null) {
   metadataColumnWriters.get(index).setNull();
 } else {
   metadataColumnWriters.get(index).setTimestamp(new 
Instant(timeValue));
 }
   }
   ```

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,
+_content_status,
+_content_type,
+_creator,
+_description,
+_identifier,
+_keywords,
+_last_modified_by_user,
+_revision,
+_subject,
+_title
+  }
+
+  private enum IMPLICIT_TIMESTAMP_COLUMNS {

Review comment:
   ```suggestion
 private enum IMPLICIT_TIMESTAMP_COLUMN {
   ```

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   

[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117318#comment-17117318
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r430090264



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,

Review comment:
   @vvysotskyi 
   Are you referring to the field names?

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,

Review comment:
   Will do. Thanks for the review :-)

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,

Review comment:
   Actually, a question about that... Since these are fields that could 
appear in a SQL query, shouldn't they use snake case?

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {

Review comment:
   Fixed

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,
+_content_status,
+_content_type,
+_creator,
+_description,
+_identifier,
+_keywords,
+_last_modified_by_user,
+_revision,
+_subject,
+_title
+  }
+
+  private enum IMPLICIT_TIMESTAMP_COLUMNS {

Review comment:
   Fixed

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -56,60 +61,70 @@
   private static final Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
 
   private static final String SAFE_WILDCARD = "_$";
-
   private static final String SAFE_SEPARATOR = "_";
-
   private static final String PARSER_WILDCARD = ".*";
-
   private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
-
   private static final String MISSING_FIELD_NAME_HEADER = "field_";
 
-  private static final int ROW_CACHE_SIZE = 100;
+  private enum IMPLICIT_STRING_COLUMNS {
+_category,

Review comment:
   @vvysotskyi 
   The point of those `enum` was to be a list of the 

[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115644#comment-17115644
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429710656



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -420,15 +510,57 @@ private void addColumnToArray(TupleWriter rowWriter, 
String name, TypeProtos.Min
 }
   }
 
+  private void addMetadataWriters() {
+for (String colName : IMPLICIT_STRING_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.VARCHAR);
+}
+for (String colName : IMPLICIT_TIMESTAMP_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.TIMESTAMP);
+}
+  }
+
+  private void addMetadataColumnsToArray(TupleWriter rowWriter, String name, 
MinorType type) {
+int index = rowWriter.tupleSchema().index(name);
+if (index == -1) {
+  ColumnMetadata colSchema = MetadataUtils.newScalar(name, type, 
TypeProtos.DataMode.OPTIONAL);
+  colSchema.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
+  index = rowWriter.addColumn(colSchema);
+} else {
+  return;
+}
+metadataColumnWriters.add(rowWriter.scalar(index));
+  }
+
+  private void writeMetadata() {
+for (int i = 0; i < metadataColumnWriters.size(); i++) {

Review comment:
   @vvysotskyi 
   I refactored this method considerably now that it uses enums.  I also 
followed @Sanel's advice and used a much simpler counter. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115642#comment-17115642
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429710349



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -420,15 +510,57 @@ private void addColumnToArray(TupleWriter rowWriter, 
String name, TypeProtos.Min
 }
   }
 
+  private void addMetadataWriters() {
+for (String colName : IMPLICIT_STRING_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.VARCHAR);
+}
+for (String colName : IMPLICIT_TIMESTAMP_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.TIMESTAMP);
+}
+  }
+
+  private void addMetadataColumnsToArray(TupleWriter rowWriter, String name, 
MinorType type) {
+int index = rowWriter.tupleSchema().index(name);
+if (index == -1) {
+  ColumnMetadata colSchema = MetadataUtils.newScalar(name, type, 
TypeProtos.DataMode.OPTIONAL);
+  colSchema.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
+  index = rowWriter.addColumn(colSchema);
+} else {
+  return;
+}
+metadataColumnWriters.add(rowWriter.scalar(index));
+  }
+
+  private void writeMetadata() {
+for (int i = 0; i < metadataColumnWriters.size(); i++) {
+  if (i < IMPLICIT_STRING_COLUMN_NAMES.size()) {

Review comment:
   @Sanel, I like that.  Done!
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115483#comment-17115483
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429675715



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -75,7 +83,9 @@
 
   private Row currentRow;
 
-  private Workbook workbook;
+  private StreamingWorkbook swb;
+
+  private CoreProperties fileMetadata;

Review comment:
   Refactored with `fileMetadata` as local variable.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115480#comment-17115480
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429675286



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -75,7 +83,9 @@
 
   private Row currentRow;
 
-  private Workbook workbook;
+  private StreamingWorkbook swb;

Review comment:
   The `getCoreProperties()` unfortunately is not in the `Workbook` class.  
I'm remembering now that was why I had to make that cast. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115476#comment-17115476
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429675060



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -348,6 +405,27 @@ private boolean nextLine(RowSetLoader rowWriter) {
 }
   }
 
+  private void populateMetadata() {
+stringMetadata = new HashMap<>();
+dateMetadata = new HashMap<>();
+
+stringMetadata.put(IMPLICIT_STRING_COLUMN_NAMES.get(0), 
fileMetadata.getCategory());

Review comment:
   Refactored with enum.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115472#comment-17115472
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

vvysotskyi commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429674200



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -75,7 +83,9 @@
 
   private Row currentRow;
 
-  private Workbook workbook;
+  private StreamingWorkbook swb;

Review comment:
   This class uses only methods from the `Workbook` interface, so I don't 
see any reason for narrowing down the type to `StreamingWorkbook` here and 
adding explicit casts for this type.

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -75,7 +83,9 @@
 
   private Row currentRow;
 
-  private Workbook workbook;
+  private StreamingWorkbook swb;
+
+  private CoreProperties fileMetadata;

Review comment:
   It is called in a single method only. May be declared a local variable 
in this method instead of adding the class field.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115463#comment-17115463
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429672215



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -147,10 +185,15 @@ private void 
openFile(FileScanFramework.FileSchemaNegotiator negotiator) {
   fsStream = 
negotiator.fileSystem().openPossiblyCompressedStream(split.getPath());
 
   // Open streaming reader
-  workbook = StreamingReader.builder()
+  Workbook workbook = StreamingReader.builder()
 .rowCacheSize(ROW_CACHE_SIZE)
 .bufferSize(BUFFER_SIZE)
+.setReadCoreProperties(true)
 .open(fsStream);
+
+  swb =(StreamingWorkbook)workbook;

Review comment:
   Fixed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115461#comment-17115461
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429672084



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -75,7 +83,9 @@
 
   private Row currentRow;
 
-  private Workbook workbook;
+  private StreamingWorkbook swb;

Review comment:
   Renamed `streamingWorkbook`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115456#comment-17115456
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429672042



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -420,15 +510,57 @@ private void addColumnToArray(TupleWriter rowWriter, 
String name, TypeProtos.Min
 }
   }
 
+  private void addMetadataWriters() {
+for (String colName : IMPLICIT_STRING_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.VARCHAR);
+}
+for (String colName : IMPLICIT_TIMESTAMP_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.TIMESTAMP);
+}
+  }
+
+  private void addMetadataColumnsToArray(TupleWriter rowWriter, String name, 
MinorType type) {
+int index = rowWriter.tupleSchema().index(name);
+if (index == -1) {
+  ColumnMetadata colSchema = MetadataUtils.newScalar(name, type, 
TypeProtos.DataMode.OPTIONAL);
+  colSchema.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
+  index = rowWriter.addColumn(colSchema);
+} else {
+  return;
+}

Review comment:
   I consolidated these two methods. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115454#comment-17115454
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429671379



##
File path: 
contrib/format-excel/src/test/java/org/apache/drill/exec/store/excel/TestExcelFormat.java
##
@@ -112,6 +112,42 @@ public void testExplicitAllQuery() throws RpcException {
 new RowSetComparison(expected).verifyAndClearAll(results);
   }
 
+
+  @Test
+  public void testExplicitMetadataQuery() throws RpcException {
+String sql =
+  "SELECT _category, _content_status, _content_type, _creator, 
_description, _identifier, _keywords, _last_modified_by_user, _revision, 
_subject, _title, _created," +
+"_last_printed, _modified FROM cp.`excel/test_data.xlsx` LIMIT 1";
+
+QueryBuilder q = client.queryBuilder().sql(sql);
+RowSet results = q.rowSet();
+results.print();

Review comment:
   Fixed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115444#comment-17115444
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429670873



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -199,7 +245,7 @@ private void getColumnHeaders(SchemaBuilder builder) {
 i++;
   }
   columnWriters = new ArrayList<>(columnCount);
-
+  metadataColumnWriters = new ArrayList<>();

Review comment:
   Moved initialization to top of method.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115442#comment-17115442
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429670719



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -75,7 +83,9 @@
 
   private Row currentRow;
 
-  private Workbook workbook;
+  private StreamingWorkbook swb;
+
+  private CoreProperties fileMetadata;

Review comment:
   Yeah, so both the workbook and the `fileMetadata` fields are called a 
few times later in the reader.  It seemed to make sense to create class 
variables for them.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115439#comment-17115439
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429670630



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -75,7 +83,9 @@
 
   private Row currentRow;
 
-  private Workbook workbook;
+  private StreamingWorkbook swb;

Review comment:
   @vvysotskyi 
   The way the underlying library is written, in order to access the metadata, 
we need the workbook to be a `StreamingWorkbook` type, but for whatever reason, 
the reader only returns the `Workbook` type.  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114939#comment-17114939
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

sanel commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429577212



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -420,15 +510,57 @@ private void addColumnToArray(TupleWriter rowWriter, 
String name, TypeProtos.Min
 }
   }
 
+  private void addMetadataWriters() {
+for (String colName : IMPLICIT_STRING_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.VARCHAR);
+}
+for (String colName : IMPLICIT_TIMESTAMP_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.TIMESTAMP);
+}
+  }
+
+  private void addMetadataColumnsToArray(TupleWriter rowWriter, String name, 
MinorType type) {
+int index = rowWriter.tupleSchema().index(name);
+if (index == -1) {
+  ColumnMetadata colSchema = MetadataUtils.newScalar(name, type, 
TypeProtos.DataMode.OPTIONAL);
+  colSchema.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
+  index = rowWriter.addColumn(colSchema);
+} else {
+  return;
+}
+metadataColumnWriters.add(rowWriter.scalar(index));
+  }
+
+  private void writeMetadata() {
+for (int i = 0; i < metadataColumnWriters.size(); i++) {
+  if (i < IMPLICIT_STRING_COLUMN_NAMES.size()) {

Review comment:
   Why just don't say something like:
   
   ```java
   int n = i - IMPLICIT_STRING_COLUMN_NAMES.size();
   if (n < 0) {
...
   } else {
...
... = dateMetadata.get(IMPLICIT_TIMESTAMP_COLUMN_NAMES.get(n);
   ...
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114922#comment-17114922
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

vvysotskyi commented on a change in pull request #2069:
URL: https://github.com/apache/drill/pull/2069#discussion_r429570614



##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -75,7 +83,9 @@
 
   private Row currentRow;
 
-  private Workbook workbook;
+  private StreamingWorkbook swb;

Review comment:
   Is there any reason for using specific implementation instead of the 
interface here?

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -75,7 +83,9 @@
 
   private Row currentRow;
 
-  private Workbook workbook;
+  private StreamingWorkbook swb;
+
+  private CoreProperties fileMetadata;

Review comment:
   This field is obtained from the previous one. Are there any reasons for 
adding it?

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -147,10 +185,15 @@ private void 
openFile(FileScanFramework.FileSchemaNegotiator negotiator) {
   fsStream = 
negotiator.fileSystem().openPossiblyCompressedStream(split.getPath());
 
   // Open streaming reader
-  workbook = StreamingReader.builder()
+  Workbook workbook = StreamingReader.builder()
 .rowCacheSize(ROW_CACHE_SIZE)
 .bufferSize(BUFFER_SIZE)
+.setReadCoreProperties(true)
 .open(fsStream);
+
+  swb =(StreamingWorkbook)workbook;

Review comment:
   ```suggestion
 swb = (StreamingWorkbook) workbook;
   ```

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -334,9 +382,18 @@ private boolean nextLine(RowSetLoader rowWriter) {
   colPosition++;
 }
 
+if (firstLine) {

Review comment:
   This if may be combined with the next one.

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -199,7 +245,7 @@ private void getColumnHeaders(SchemaBuilder builder) {
 i++;
   }
   columnWriters = new ArrayList<>(columnCount);
-
+  metadataColumnWriters = new ArrayList<>();

Review comment:
   This field is initialized here, and after the if-else block, it is 
initialized with the same value. And `columnWriters` value also.

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -348,6 +405,27 @@ private boolean nextLine(RowSetLoader rowWriter) {
 }
   }
 
+  private void populateMetadata() {
+stringMetadata = new HashMap<>();
+dateMetadata = new HashMap<>();
+
+stringMetadata.put(IMPLICIT_STRING_COLUMN_NAMES.get(0), 
fileMetadata.getCategory());

Review comment:
   Please do not rely on the index of the specific column. You may define 
an enum with implicit excel field names and use its elements here and iterate 
through all its elements where required instead of using the list.

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -420,15 +510,57 @@ private void addColumnToArray(TupleWriter rowWriter, 
String name, TypeProtos.Min
 }
   }
 
+  private void addMetadataWriters() {
+for (String colName : IMPLICIT_STRING_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.VARCHAR);
+}
+for (String colName : IMPLICIT_TIMESTAMP_COLUMN_NAMES) {
+  addMetadataColumnsToArray(rowWriter, colName, MinorType.TIMESTAMP);
+}
+  }
+
+  private void addMetadataColumnsToArray(TupleWriter rowWriter, String name, 
MinorType type) {
+int index = rowWriter.tupleSchema().index(name);
+if (index == -1) {
+  ColumnMetadata colSchema = MetadataUtils.newScalar(name, type, 
TypeProtos.DataMode.OPTIONAL);
+  colSchema.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
+  index = rowWriter.addColumn(colSchema);
+} else {
+  return;
+}
+metadataColumnWriters.add(rowWriter.scalar(index));
+  }
+
+  private void writeMetadata() {
+for (int i = 0; i < metadataColumnWriters.size(); i++) {

Review comment:
   `metadataColumnWriters` is populated using 
`IMPLICIT_STRING_COLUMN_NAMES` and `IMPLICIT_TIMESTAMP_COLUMN_NAMES` lists. Is 
it possible instead of iterating through `metadataColumnWriters` values and 
adding a check for the index, iterate through values of the lists only?

##
File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##
@@ -75,7 +83,9 @@
 
   private Row currentRow;
 
-  private Workbook workbook;
+  private StreamingWorkbook swb;


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114335#comment-17114335
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

vvysotskyi commented on pull request #2069:
URL: https://github.com/apache/drill/pull/2069#issuecomment-632861361


   @cgivre, I'll take a look this weekend.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114308#comment-17114308
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on pull request #2069:
URL: https://github.com/apache/drill/pull/2069#issuecomment-632839507


   @vvysotskyi 
   Any comments on this or is it good to go?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109265#comment-17109265
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre commented on pull request #2069:
URL: https://github.com/apache/drill/pull/2069#issuecomment-629728318


   @vvysotskyi 
   That is correct.  This follows the model for EVF readers where data is 
written to the writers and the EVF framework determines which columns to 
actually project.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-05-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108733#comment-17108733
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

vvysotskyi commented on pull request #2069:
URL: https://github.com/apache/drill/pull/2069#issuecomment-629542378


   @cgivre, will these "implicit" columns values be populated (but not 
returned) every time the file is read even they aren't specified in the query?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7723) Add Excel Metadata as Implicit Fields

2020-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096034#comment-17096034
 ] 

ASF GitHub Bot commented on DRILL-7723:
---

cgivre opened a new pull request #2069:
URL: https://github.com/apache/drill/pull/2069


   # [DRILL-7723](https://issues.apache.org/jira/browse/DRILL-7723): Add Excel 
Metadata as Implicit Fields
   
   ## Description
   
   Excel files contain a significant number of metadata fields in them. This PR 
adds the ability to access these fields via implicit columns. The columns are:
   
   ```
   _category
   _content_status
   _content_type;
   _creator
   _description
   _identifier
   _keywords
   _last_modified_by_user
   _revision
   _subject
   _title
   _created
   _last_printed
   _modified
   ```
   These fields are not projected in star queries so there is no effect on 
existing queries.
   
   ## Documentation
   Updated the `README.md` file with the implicit field information.
   
   ## Testing
   Added a unit test for the metadata fields.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Excel Metadata as Implicit Fields
> -
>
> Key: DRILL-7723
> URL: https://issues.apache.org/jira/browse/DRILL-7723
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Text  CSV
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> Excel files contain a significant number of metadata fields in them. This PR 
> adds the ability to access these fields via implicit columns. The columns are:
> _category
> _content_status
> _content_type;
> _creator
> _description
> _identifier
> _keywords
> _last_modified_by_user
> _revision
> _subject
> _title
> _created
> _last_printed
> _modified
> These fields are not projected in star queries so there is no effect on 
> existing queries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)