[GitHub] [drill] cgivre opened a new pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader

2020-03-12 Thread GitBox
cgivre opened a new pull request #2024: DRILL-7641 Convert Excel Reader to use 
Streaming Reader
URL: https://github.com/apache/drill/pull/2024
 
 
   # [DRILL-7641](https://issues.apache.org/jira/browse/DRILL-7641): Convert 
Excel Reader to use Streaming Reader
   
   ## Description
   The current implementation of the Excel reader uses the Apache POI reader, 
which uses excessive amounts of memory. As a result, attempting to read large 
Excel files will cause out of memory errors. 
   This PR converts the format plugin to use a streaming reader, based still on 
the POI library.  The documentation for the streaming reader can be found here. 
[1]. This library was billed as a drop in replacement for the POI reader, 
however I had to make some minor changes to the batch reader to get this to 
work.  Minor code cleanup as well. 
   
   [1]: https://github.com/pjfanning/excel-streaming-reader
   
   ## Documentation
   No user visible changes.
   
   ## Testing
   All unit tests from the original plugin pass.  Additionally, I tested this 
with large Excel files on my local machine and Drill was able to query them 
whereas before this PR, Drill would run out of memory.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (DRILL-7641) Convert Excel Reader to Use Streaming Reader

2020-03-12 Thread Charles Givre (Jira)
Charles Givre created DRILL-7641:


 Summary: Convert Excel Reader to Use Streaming Reader
 Key: DRILL-7641
 URL: https://issues.apache.org/jira/browse/DRILL-7641
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Text  CSV
Affects Versions: 1.17.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.18.0


The current implementation of the Excel reader uses the Apache POI reader, 
which uses excessive amounts of memory. As a result, attempting to read large 
Excel files will cause out of memory errors. 

This PR converts the format plugin to use a streaming reader, based still on 
the POI library.  The documentation for the streaming reader can be found here. 
[1]

All unit tests pass and I tested the plugin with some large Excel files on my 
computer.

[1]: [https://github.com/pjfanning/excel-streaming-reader]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [drill] paul-rogers opened a new pull request #2023: DRILL-7640: EVF-based JSON Loader

2020-03-12 Thread GitBox
paul-rogers opened a new pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023
 
 
   # [DRILL-7640](https://issues.apache.org/jira/browse/DRILL-7640): EVF-based 
JSON Loader
   
   ## Description
   
   Builds on the JSON structure parser and several other PRs to provide an 
enhanced, robust mechanism to read JSON data into value vectors via the EVF. 
This is not the JSON reader, rather it is the "V2" version of the JsonProcessor 
which does the actual JSON parsing/loading work.
   
   This PR is a partial do-over of an earlier PR, DRILL-6953. This PR contains 
only the lower-level JSON loader. A new PR for DRILL-6953 will follow which 
will add the JSON reader and fix any compatibility issues. Doing the PR 
separately reduces the size of each PR, making them easier to review and manage.
   
   The concept is that we already have the JSON structure parser (thanks to 
those that reviewed those PRs!). The structure parser emits *events* to 
*listeners*. This PR implements the listeners which use EVF to write values to 
value vectors.
   
   Some new features provided by the JSON loader include:
   
   * Built in conversion from (almost) any JSON type to (almost) any other 
type. If a field starts a String, and shifts to Number, Drill will continue to 
write the value as `VARCHAR`.
   * Allow runs of nulls (`null`) and/or empty arrays (`[ ]`). Defer type 
selection until the first actual value appears. (Or, force selection of 
`Nullable VARCHAR` if the batch ends without seeing any type.)
   * An array with null entries `[ null ]` forces type selection to `Nullable 
VARCHAR` since we must count the null entries.
   * Support for a provided schema. If the input is ambiguous, or inconsistent, 
the provided schema states the desired column type; the JSON loader converts 
data to that type.
   * The provided schema allows "text mode" selection per-column. The JSON 
loader supports the "all text mode" setting from before, but now allows "text 
mode" per column for only those columns which are a problem.
   * The provided schema also allows a new "JSON mode": the field, and all its 
children, are read as JSON text. That is, if the JSON is `{a: {b: ["foo", 10, 
true]}}`, and column `a` is read as JSON, then the value of `a` is `"{b: 
["foo", 10, true]}"`.
   
   We can expect a number of tweaks and adjustments to be needed in a later PR 
when we get the existing tests to pass with the new JSON format plugin. The 
goal here is to simply get the bulk of the work reviewed separately.
   
   To be clear, in this PR, nothing other than unit tests uses this new code.
   
   This PR contains a few changes which duplicate those in the PR for 
DRILL-7633. Once DRILL-7633 is merged, this PR will be rebased and the 
overlapping changes will disappear.
   
   ## Documentation
   
   Once the JSON format plugin PR is submitted, users can use the provided 
column properties mentioned above. This PR does not yet expose them, so no 
documentation is needed yet.
   
   ## Testing
   
   Added unit tests for all of the cases and features described above. Also 
tests failure cases (such as JSON with inconsistent types which the JSON loader 
cannot handle.)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] dobesv closed pull request #1997: DRILL-7604: Allow session options to be set in HTTP queries

2020-03-12 Thread GitBox
dobesv closed pull request #1997: DRILL-7604: Allow session options to be set 
in HTTP queries
URL: https://github.com/apache/drill/pull/1997
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] dobesv commented on issue #1997: DRILL-7604: Allow session options to be set in HTTP queries

2020-03-12 Thread GitBox
dobesv commented on issue #1997: DRILL-7604: Allow session options to be set in 
HTTP queries
URL: https://github.com/apache/drill/pull/1997#issuecomment-598359346
 
 
   OK, I'll just set this aside for now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (DRILL-7640) EVF-based JSON Loader

2020-03-12 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7640:
--

 Summary: EVF-based JSON Loader
 Key: DRILL-7640
 URL: https://issues.apache.org/jira/browse/DRILL-7640
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Builds on the JSON structure parser and several other PRs to provide an 
enhanced, robust mechanism to read JSON data into value vectors via the EVF. 
This is not the JSON reader, rather it is the "V2" version of the 
\{{JsonProcessor}} which does the actual JSON parsing/loading work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [drill] paul-rogers commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues

2020-03-12 Thread GitBox
paul-rogers commented on a change in pull request #2010: DRILL-7620: Fix plugin 
mutability issues
URL: https://github.com/apache/drill/pull/2010#discussion_r391761129
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/StoragePluginRegistryImpl.java
 ##
 @@ -505,14 +530,19 @@ public StoragePluginConfig 
copyConfig(StoragePluginConfig orig) {
 
   @Override
   public StoragePluginConfig getDefinedConfig(String name) {
+try {
+  name = validateName(name);
+} catch (PluginException e) {
 
 Review comment:
   Added comment. The name is not valid so no config matches the name.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] asfgit closed pull request #2020: DRILL-7634: Rollup of code cleanup changes

2020-03-12 Thread GitBox
asfgit closed pull request #2020: DRILL-7634: Rollup of code cleanup changes
URL: https://github.com/apache/drill/pull/2020
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] asfgit closed pull request #2016: DRILL-7631: Updates to the Json Structure Parser

2020-03-12 Thread GitBox
asfgit closed pull request #2016: DRILL-7631: Updates to the Json Structure 
Parser
URL: https://github.com/apache/drill/pull/2016
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (DRILL-7639) Replace DBCP2 with HikariCP in RDBMS (JDBC) plugin

2020-03-12 Thread Arina Ielchiieva (Jira)
Arina Ielchiieva created DRILL-7639:
---

 Summary: Replace DBCP2 with HikariCP in RDBMS (JDBC) plugin
 Key: DRILL-7639
 URL: https://issues.apache.org/jira/browse/DRILL-7639
 Project: Apache Drill
  Issue Type: Task
Affects Versions: 1.17.0
Reporter: Arina Ielchiieva
Assignee: Arina Ielchiieva


Hikari is much faster and reliable than DBCP2.
See comparison and benchmarks:
https://beansroasted.wordpress.com/2017/07/29/connection-pool-analysis/
https://github.com/brettwooldridge/HikariCP





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [drill] agozhiy opened a new pull request #2022: DRILL-7637: Add an option to retrieve MapR SSL truststore/keystore cr…

2020-03-12 Thread GitBox
agozhiy opened a new pull request #2022: DRILL-7637: Add an option to retrieve 
MapR SSL truststore/keystore cr…
URL: https://github.com/apache/drill/pull/2022
 
 
   …edentials using MapR Web Security Manager
   
   # [DRILL-7637](https://issues.apache.org/jira/browse/DRILL-7637): Add an 
option to retrieve MapR SSL truststore/keystore credentials using MapR Web 
Security Manager
   
   ## Description
   An option added to provide MapR truststore/keystore credentials using MapR 
Web Security Manager instead of setting passwords openly in the config.  
   
   ## Documentation
   To use this feature user need to:
   - Install Drill on MapR platform.
   - Enable the corresponding option in the Drill config:
   `drill.exec.ssl.useMapRSSLConfig: true`
   - Pass this option with the connection string using sqlline: 
   `./bin/sqlline -u 
"jdbc:drill:drillbit=node1.cluster.com:31010;enableTLS=true;useMapRSSLConfig=true"`
   
   ## Testing
   - Unit tests passed w/ and w/o mapr profile.
   - Functional/Advanced tests passed.
   - Manually tested Web UI and sqlline with the feature enabled/disabled.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] arina-ielchiieva commented on issue #1997: DRILL-7604: Allow session options to be set in HTTP queries

2020-03-12 Thread GitBox
arina-ielchiieva commented on issue #1997: DRILL-7604: Allow session options to 
be set in HTTP queries
URL: https://github.com/apache/drill/pull/1997#issuecomment-598079843
 
 
   @dobesv the problem is that I don't really like parsing of the options 
introduced in this PR (its partially duplicates work done in Drill plus does 
parsing in slightly different way which will lead to discrepancies in future), 
there is much cleaner way to parse options, you need to pass option name, its 
kind and value. Please refer to my previous comment 
https://github.com/apache/drill/pull/1997/files#r391122758 so this PR needs 
re-working.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

2020-03-12 Thread GitBox
arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes 
for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#discussion_r391479369
 
 

 ##
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/record/metadata/VariantColumnMetadata.java
 ##
 @@ -96,11 +134,13 @@ public StructureType structureType() {
   public boolean isVariant() { return true; }
 
   @Override
-  public boolean isArray() { return type() == MinorType.LIST; }
+  public boolean isArray() {
+return super.isArray() || type() == MinorType.LIST;
+  }
 
   @Override
   public ColumnMetadata cloneEmpty() {
-return new VariantColumnMetadata(name, type, variantSchema.cloneEmpty());
+return new VariantColumnMetadata(name, type, mode, new VariantSchema());
 
 Review comment:
   Well, it turned out the `typeString` is also used during column metadata 
serialization / deserialization. Type is serialized in schematic type to 
consume less space, plus it's also stored in the same way in Metastore (see 
org.apache.drill.exec.record.metadata.AbstractColumnMetadata#createColumnMetadata).
 So currently ser / de for `VariantColumnMetadata` will fail. Since you are 
planning to use it in JSON, we need to ensure this class can be properly 
serialized and deserialized. To ensure this we need to do two things:
   1. update `typeString` (I propose we do this in this PR).
   2. update schema parser (I can do it in separate PR).
   
   Back to `typeString`:
   1. If `VariantColumnMetadata` is based on `UNION`, we can describe it as 
`UNION` without types but we'll lose all types defined in `VariantSchema` or we 
can describe it as UNION and preserve types. The question is if 
we need types during ser / de?
   2. Similar case when `VariantColumnMetadata` is based on `LIST`,  we can 
describe it as ARRAY without types or with types ARRAY>.
   Since you know more about unions, I'll leave to you the final decision on 
how schematically union types should be represented, afterwards I'll adjust the 
schema parser.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues

2020-03-12 Thread GitBox
arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix 
plugin mutability issues
URL: https://github.com/apache/drill/pull/2010#discussion_r391471772
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/StoragePluginRegistry.java
 ##
 @@ -96,6 +96,23 @@ public PluginEncodingException(String msg, Exception e) {
*/
   void validatedPut(String name, StoragePluginConfig config) throws 
PluginException;
 
+  /**
+   * Set the plugin to the requested enabled state. Does nothing if the plugin
+   * is already in the requested state. If a formerly enabled plugin is
+   * disabled, moves the plugin from the in-memory cache to the ephemeral
+   * store. If a formerly disabled plugin is enabled, verifies that the plugin
+   * can be instantiated as for {@link #verifiedPut()}.
+   * 
+   * Use this method when changing state. Do not obtain the config and change
+   * the state directly, doing so will make the plugin config inconsistent
+   * with the internal state.
+   *
+   * @param name name of the plugin
+   * @param enabled {@code true} to enable the plugin, {@code false} to disable
+   * @throws PluginNotFoundException it the plugin name is not valid
 
 Review comment:
   Can't this exception be thrown is plugin is missing? Why only invalid name?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues

2020-03-12 Thread GitBox
arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix 
plugin mutability issues
URL: https://github.com/apache/drill/pull/2010#discussion_r391470413
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/sequencefile/SequenceFileFormatConfig.java
 ##
 @@ -20,35 +20,42 @@
 import com.fasterxml.jackson.annotation.JsonInclude;
 import com.fasterxml.jackson.annotation.JsonTypeName;
 import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList;
-
+import org.apache.drill.common.PlanStringBuilder;
 import org.apache.drill.common.logical.FormatPluginConfig;
 
 import java.util.List;
+import java.util.Objects;
 
 @JsonTypeName("sequencefile") @JsonInclude(JsonInclude.Include.NON_DEFAULT)
 public class SequenceFileFormatConfig implements FormatPluginConfig {
 
   public List extensions = ImmutableList.of();
 
-  @Override
-  public int hashCode() {
-return (extensions == null)? 0 : extensions.hashCode();
-  }
-
   public List getExtensions() {
 return extensions;
   }
 
+  @Override
+  public int hashCode() {
+return Objects.hash(extensions);
+  }
+
   @Override
   public boolean equals(Object obj) {
 if (this == obj) {
   return true;
-} else if (obj == null) {
+}
+if (obj == null || getClass() != obj.getClass()) {
   return false;
-} else if (getClass() == obj.getClass()) {
-  SequenceFileFormatConfig other = (SequenceFileFormatConfig) obj;
-  return (extensions == null)? (other.extensions == null) : 
extensions.equals(other.extensions);
 }
-return false;
+SequenceFileFormatConfig other = (SequenceFileFormatConfig) obj;
+return Objects.equals(extensions, other.extensions);
+  }
+
+  @Override
+  public String toString() {
+return new PlanStringBuilder(this)
+.field("extensions", extensions)
+.toString();
   }
 }
 
 Review comment:
   Please add new line


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues

2020-03-12 Thread GitBox
arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix 
plugin mutability issues
URL: https://github.com/apache/drill/pull/2010#discussion_r391470057
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/StoragePluginRegistryImpl.java
 ##
 @@ -505,14 +530,19 @@ public StoragePluginConfig 
copyConfig(StoragePluginConfig orig) {
 
   @Override
   public StoragePluginConfig getDefinedConfig(String name) {
+try {
+  name = validateName(name);
+} catch (PluginException e) {
 
 Review comment:
   Should we log it somewhere or just ignore?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues

2020-03-12 Thread GitBox
arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix 
plugin mutability issues
URL: https://github.com/apache/drill/pull/2010#discussion_r391469634
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/StoragePluginRegistryImpl.java
 ##
 @@ -233,6 +233,17 @@ private void loadIntrinsicPlugins() throws 
PluginException {
 continue;
   }
   for (StoragePlugin sysPlugin : intrinsicPlugins) {
+// Enforce lower case names. Since the name of a system plugin
+// is "hard coded", we can't adjust it if it is not already
+// lower case. All we can do is fail to tell the developer that
+// something is wrong.
+String origName = sysPlugin.getName();
+String lcName = sysPlugin.getName().toLowerCase();
+if (!origName.equals(lcName)) {
+  throw new IllegalStateException(String.format(
+  "Plugin names must be lower case. System plugin name is not 
lower case: %s",
 
 Review comment:
   ```suggestion
 "Plugin names must be in lower case. System plugin name is not 
lower case: %s",
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] paul-rogers commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

2020-03-12 Thread GitBox
paul-rogers commented on a change in pull request #2018: DRILL-7633: Fixes for 
union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#discussion_r391431754
 
 

 ##
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/record/metadata/VariantColumnMetadata.java
 ##
 @@ -96,11 +134,13 @@ public StructureType structureType() {
   public boolean isVariant() { return true; }
 
   @Override
-  public boolean isArray() { return type() == MinorType.LIST; }
+  public boolean isArray() {
+return super.isArray() || type() == MinorType.LIST;
+  }
 
   @Override
   public ColumnMetadata cloneEmpty() {
-return new VariantColumnMetadata(name, type, variantSchema.cloneEmpty());
+return new VariantColumnMetadata(name, type, mode, new VariantSchema());
 
 Review comment:
   Nicely done! We discussed the problem with `UNION` a while back and realized 
it was a mess. Now I gotta deal with it...
   
   The `typeString()` version is used, I presume, in a provided schema. Correct?
   
   If so, then then the base class implementation is fine, the output will be 
just `UNION`. If we think of Drill's `UNION` as like a Visual Basic or 
PostgreSQL variant (or a Python variable), then the `UNION` type is sufficient: 
a `UNION` can hold anything; no need to restrict what it might or might not 
hold.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] paul-rogers commented on issue #2016: DRILL-7631: Updates to the Json Structure Parser

2020-03-12 Thread GitBox
paul-rogers commented on issue #2016: DRILL-7631: Updates to the Json Structure 
Parser
URL: https://github.com/apache/drill/pull/2016#issuecomment-598033124
 
 
   @vvysotskyi, thanks much for your review! Addressed the review comments. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services