[GitHub] [drill] cgivre opened a new pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader
cgivre opened a new pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader URL: https://github.com/apache/drill/pull/2024 # [DRILL-7641](https://issues.apache.org/jira/browse/DRILL-7641): Convert Excel Reader to use Streaming Reader ## Description The current implementation of the Excel reader uses the Apache POI reader, which uses excessive amounts of memory. As a result, attempting to read large Excel files will cause out of memory errors. This PR converts the format plugin to use a streaming reader, based still on the POI library. The documentation for the streaming reader can be found here. [1]. This library was billed as a drop in replacement for the POI reader, however I had to make some minor changes to the batch reader to get this to work. Minor code cleanup as well. [1]: https://github.com/pjfanning/excel-streaming-reader ## Documentation No user visible changes. ## Testing All unit tests from the original plugin pass. Additionally, I tested this with large Excel files on my local machine and Drill was able to query them whereas before this PR, Drill would run out of memory. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (DRILL-7641) Convert Excel Reader to Use Streaming Reader
Charles Givre created DRILL-7641: Summary: Convert Excel Reader to Use Streaming Reader Key: DRILL-7641 URL: https://issues.apache.org/jira/browse/DRILL-7641 Project: Apache Drill Issue Type: Improvement Components: Storage - Text CSV Affects Versions: 1.17.0 Reporter: Charles Givre Assignee: Charles Givre Fix For: 1.18.0 The current implementation of the Excel reader uses the Apache POI reader, which uses excessive amounts of memory. As a result, attempting to read large Excel files will cause out of memory errors. This PR converts the format plugin to use a streaming reader, based still on the POI library. The documentation for the streaming reader can be found here. [1] All unit tests pass and I tested the plugin with some large Excel files on my computer. [1]: [https://github.com/pjfanning/excel-streaming-reader] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [drill] paul-rogers opened a new pull request #2023: DRILL-7640: EVF-based JSON Loader
paul-rogers opened a new pull request #2023: DRILL-7640: EVF-based JSON Loader URL: https://github.com/apache/drill/pull/2023 # [DRILL-7640](https://issues.apache.org/jira/browse/DRILL-7640): EVF-based JSON Loader ## Description Builds on the JSON structure parser and several other PRs to provide an enhanced, robust mechanism to read JSON data into value vectors via the EVF. This is not the JSON reader, rather it is the "V2" version of the JsonProcessor which does the actual JSON parsing/loading work. This PR is a partial do-over of an earlier PR, DRILL-6953. This PR contains only the lower-level JSON loader. A new PR for DRILL-6953 will follow which will add the JSON reader and fix any compatibility issues. Doing the PR separately reduces the size of each PR, making them easier to review and manage. The concept is that we already have the JSON structure parser (thanks to those that reviewed those PRs!). The structure parser emits *events* to *listeners*. This PR implements the listeners which use EVF to write values to value vectors. Some new features provided by the JSON loader include: * Built in conversion from (almost) any JSON type to (almost) any other type. If a field starts a String, and shifts to Number, Drill will continue to write the value as `VARCHAR`. * Allow runs of nulls (`null`) and/or empty arrays (`[ ]`). Defer type selection until the first actual value appears. (Or, force selection of `Nullable VARCHAR` if the batch ends without seeing any type.) * An array with null entries `[ null ]` forces type selection to `Nullable VARCHAR` since we must count the null entries. * Support for a provided schema. If the input is ambiguous, or inconsistent, the provided schema states the desired column type; the JSON loader converts data to that type. * The provided schema allows "text mode" selection per-column. The JSON loader supports the "all text mode" setting from before, but now allows "text mode" per column for only those columns which are a problem. * The provided schema also allows a new "JSON mode": the field, and all its children, are read as JSON text. That is, if the JSON is `{a: {b: ["foo", 10, true]}}`, and column `a` is read as JSON, then the value of `a` is `"{b: ["foo", 10, true]}"`. We can expect a number of tweaks and adjustments to be needed in a later PR when we get the existing tests to pass with the new JSON format plugin. The goal here is to simply get the bulk of the work reviewed separately. To be clear, in this PR, nothing other than unit tests uses this new code. This PR contains a few changes which duplicate those in the PR for DRILL-7633. Once DRILL-7633 is merged, this PR will be rebased and the overlapping changes will disappear. ## Documentation Once the JSON format plugin PR is submitted, users can use the provided column properties mentioned above. This PR does not yet expose them, so no documentation is needed yet. ## Testing Added unit tests for all of the cases and features described above. Also tests failure cases (such as JSON with inconsistent types which the JSON loader cannot handle.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] dobesv closed pull request #1997: DRILL-7604: Allow session options to be set in HTTP queries
dobesv closed pull request #1997: DRILL-7604: Allow session options to be set in HTTP queries URL: https://github.com/apache/drill/pull/1997 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] dobesv commented on issue #1997: DRILL-7604: Allow session options to be set in HTTP queries
dobesv commented on issue #1997: DRILL-7604: Allow session options to be set in HTTP queries URL: https://github.com/apache/drill/pull/1997#issuecomment-598359346 OK, I'll just set this aside for now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (DRILL-7640) EVF-based JSON Loader
Paul Rogers created DRILL-7640: -- Summary: EVF-based JSON Loader Key: DRILL-7640 URL: https://issues.apache.org/jira/browse/DRILL-7640 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.17.0 Reporter: Paul Rogers Assignee: Paul Rogers Fix For: 1.18.0 Builds on the JSON structure parser and several other PRs to provide an enhanced, robust mechanism to read JSON data into value vectors via the EVF. This is not the JSON reader, rather it is the "V2" version of the \{{JsonProcessor}} which does the actual JSON parsing/loading work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [drill] paul-rogers commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues
paul-rogers commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues URL: https://github.com/apache/drill/pull/2010#discussion_r391761129 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/StoragePluginRegistryImpl.java ## @@ -505,14 +530,19 @@ public StoragePluginConfig copyConfig(StoragePluginConfig orig) { @Override public StoragePluginConfig getDefinedConfig(String name) { +try { + name = validateName(name); +} catch (PluginException e) { Review comment: Added comment. The name is not valid so no config matches the name. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] asfgit closed pull request #2020: DRILL-7634: Rollup of code cleanup changes
asfgit closed pull request #2020: DRILL-7634: Rollup of code cleanup changes URL: https://github.com/apache/drill/pull/2020 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] asfgit closed pull request #2016: DRILL-7631: Updates to the Json Structure Parser
asfgit closed pull request #2016: DRILL-7631: Updates to the Json Structure Parser URL: https://github.com/apache/drill/pull/2016 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (DRILL-7639) Replace DBCP2 with HikariCP in RDBMS (JDBC) plugin
Arina Ielchiieva created DRILL-7639: --- Summary: Replace DBCP2 with HikariCP in RDBMS (JDBC) plugin Key: DRILL-7639 URL: https://issues.apache.org/jira/browse/DRILL-7639 Project: Apache Drill Issue Type: Task Affects Versions: 1.17.0 Reporter: Arina Ielchiieva Assignee: Arina Ielchiieva Hikari is much faster and reliable than DBCP2. See comparison and benchmarks: https://beansroasted.wordpress.com/2017/07/29/connection-pool-analysis/ https://github.com/brettwooldridge/HikariCP -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [drill] agozhiy opened a new pull request #2022: DRILL-7637: Add an option to retrieve MapR SSL truststore/keystore cr…
agozhiy opened a new pull request #2022: DRILL-7637: Add an option to retrieve MapR SSL truststore/keystore cr… URL: https://github.com/apache/drill/pull/2022 …edentials using MapR Web Security Manager # [DRILL-7637](https://issues.apache.org/jira/browse/DRILL-7637): Add an option to retrieve MapR SSL truststore/keystore credentials using MapR Web Security Manager ## Description An option added to provide MapR truststore/keystore credentials using MapR Web Security Manager instead of setting passwords openly in the config. ## Documentation To use this feature user need to: - Install Drill on MapR platform. - Enable the corresponding option in the Drill config: `drill.exec.ssl.useMapRSSLConfig: true` - Pass this option with the connection string using sqlline: `./bin/sqlline -u "jdbc:drill:drillbit=node1.cluster.com:31010;enableTLS=true;useMapRSSLConfig=true"` ## Testing - Unit tests passed w/ and w/o mapr profile. - Functional/Advanced tests passed. - Manually tested Web UI and sqlline with the feature enabled/disabled. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] arina-ielchiieva commented on issue #1997: DRILL-7604: Allow session options to be set in HTTP queries
arina-ielchiieva commented on issue #1997: DRILL-7604: Allow session options to be set in HTTP queries URL: https://github.com/apache/drill/pull/1997#issuecomment-598079843 @dobesv the problem is that I don't really like parsing of the options introduced in this PR (its partially duplicates work done in Drill plus does parsing in slightly different way which will lead to discrepancies in future), there is much cleaner way to parse options, you need to pass option name, its kind and value. Please refer to my previous comment https://github.com/apache/drill/pull/1997/files#r391122758 so this PR needs re-working. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors URL: https://github.com/apache/drill/pull/2018#discussion_r391479369 ## File path: exec/vector/src/main/java/org/apache/drill/exec/record/metadata/VariantColumnMetadata.java ## @@ -96,11 +134,13 @@ public StructureType structureType() { public boolean isVariant() { return true; } @Override - public boolean isArray() { return type() == MinorType.LIST; } + public boolean isArray() { +return super.isArray() || type() == MinorType.LIST; + } @Override public ColumnMetadata cloneEmpty() { -return new VariantColumnMetadata(name, type, variantSchema.cloneEmpty()); +return new VariantColumnMetadata(name, type, mode, new VariantSchema()); Review comment: Well, it turned out the `typeString` is also used during column metadata serialization / deserialization. Type is serialized in schematic type to consume less space, plus it's also stored in the same way in Metastore (see org.apache.drill.exec.record.metadata.AbstractColumnMetadata#createColumnMetadata). So currently ser / de for `VariantColumnMetadata` will fail. Since you are planning to use it in JSON, we need to ensure this class can be properly serialized and deserialized. To ensure this we need to do two things: 1. update `typeString` (I propose we do this in this PR). 2. update schema parser (I can do it in separate PR). Back to `typeString`: 1. If `VariantColumnMetadata` is based on `UNION`, we can describe it as `UNION` without types but we'll lose all types defined in `VariantSchema` or we can describe it as UNION and preserve types. The question is if we need types during ser / de? 2. Similar case when `VariantColumnMetadata` is based on `LIST`, we can describe it as ARRAY without types or with types ARRAY>. Since you know more about unions, I'll leave to you the final decision on how schematically union types should be represented, afterwards I'll adjust the schema parser. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues
arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues URL: https://github.com/apache/drill/pull/2010#discussion_r391471772 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/StoragePluginRegistry.java ## @@ -96,6 +96,23 @@ public PluginEncodingException(String msg, Exception e) { */ void validatedPut(String name, StoragePluginConfig config) throws PluginException; + /** + * Set the plugin to the requested enabled state. Does nothing if the plugin + * is already in the requested state. If a formerly enabled plugin is + * disabled, moves the plugin from the in-memory cache to the ephemeral + * store. If a formerly disabled plugin is enabled, verifies that the plugin + * can be instantiated as for {@link #verifiedPut()}. + * + * Use this method when changing state. Do not obtain the config and change + * the state directly, doing so will make the plugin config inconsistent + * with the internal state. + * + * @param name name of the plugin + * @param enabled {@code true} to enable the plugin, {@code false} to disable + * @throws PluginNotFoundException it the plugin name is not valid Review comment: Can't this exception be thrown is plugin is missing? Why only invalid name? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues
arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues URL: https://github.com/apache/drill/pull/2010#discussion_r391470413 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/sequencefile/SequenceFileFormatConfig.java ## @@ -20,35 +20,42 @@ import com.fasterxml.jackson.annotation.JsonInclude; import com.fasterxml.jackson.annotation.JsonTypeName; import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList; - +import org.apache.drill.common.PlanStringBuilder; import org.apache.drill.common.logical.FormatPluginConfig; import java.util.List; +import java.util.Objects; @JsonTypeName("sequencefile") @JsonInclude(JsonInclude.Include.NON_DEFAULT) public class SequenceFileFormatConfig implements FormatPluginConfig { public List extensions = ImmutableList.of(); - @Override - public int hashCode() { -return (extensions == null)? 0 : extensions.hashCode(); - } - public List getExtensions() { return extensions; } + @Override + public int hashCode() { +return Objects.hash(extensions); + } + @Override public boolean equals(Object obj) { if (this == obj) { return true; -} else if (obj == null) { +} +if (obj == null || getClass() != obj.getClass()) { return false; -} else if (getClass() == obj.getClass()) { - SequenceFileFormatConfig other = (SequenceFileFormatConfig) obj; - return (extensions == null)? (other.extensions == null) : extensions.equals(other.extensions); } -return false; +SequenceFileFormatConfig other = (SequenceFileFormatConfig) obj; +return Objects.equals(extensions, other.extensions); + } + + @Override + public String toString() { +return new PlanStringBuilder(this) +.field("extensions", extensions) +.toString(); } } Review comment: Please add new line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues
arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues URL: https://github.com/apache/drill/pull/2010#discussion_r391470057 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/StoragePluginRegistryImpl.java ## @@ -505,14 +530,19 @@ public StoragePluginConfig copyConfig(StoragePluginConfig orig) { @Override public StoragePluginConfig getDefinedConfig(String name) { +try { + name = validateName(name); +} catch (PluginException e) { Review comment: Should we log it somewhere or just ignore? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues
arina-ielchiieva commented on a change in pull request #2010: DRILL-7620: Fix plugin mutability issues URL: https://github.com/apache/drill/pull/2010#discussion_r391469634 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/StoragePluginRegistryImpl.java ## @@ -233,6 +233,17 @@ private void loadIntrinsicPlugins() throws PluginException { continue; } for (StoragePlugin sysPlugin : intrinsicPlugins) { +// Enforce lower case names. Since the name of a system plugin +// is "hard coded", we can't adjust it if it is not already +// lower case. All we can do is fail to tell the developer that +// something is wrong. +String origName = sysPlugin.getName(); +String lcName = sysPlugin.getName().toLowerCase(); +if (!origName.equals(lcName)) { + throw new IllegalStateException(String.format( + "Plugin names must be lower case. System plugin name is not lower case: %s", Review comment: ```suggestion "Plugin names must be in lower case. System plugin name is not lower case: %s", ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
paul-rogers commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors URL: https://github.com/apache/drill/pull/2018#discussion_r391431754 ## File path: exec/vector/src/main/java/org/apache/drill/exec/record/metadata/VariantColumnMetadata.java ## @@ -96,11 +134,13 @@ public StructureType structureType() { public boolean isVariant() { return true; } @Override - public boolean isArray() { return type() == MinorType.LIST; } + public boolean isArray() { +return super.isArray() || type() == MinorType.LIST; + } @Override public ColumnMetadata cloneEmpty() { -return new VariantColumnMetadata(name, type, variantSchema.cloneEmpty()); +return new VariantColumnMetadata(name, type, mode, new VariantSchema()); Review comment: Nicely done! We discussed the problem with `UNION` a while back and realized it was a mess. Now I gotta deal with it... The `typeString()` version is used, I presume, in a provided schema. Correct? If so, then then the base class implementation is fine, the output will be just `UNION`. If we think of Drill's `UNION` as like a Visual Basic or PostgreSQL variant (or a Python variable), then the `UNION` type is sufficient: a `UNION` can hold anything; no need to restrict what it might or might not hold. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on issue #2016: DRILL-7631: Updates to the Json Structure Parser
paul-rogers commented on issue #2016: DRILL-7631: Updates to the Json Structure Parser URL: https://github.com/apache/drill/pull/2016#issuecomment-598033124 @vvysotskyi, thanks much for your review! Addressed the review comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services