[jira] [Commented] (DRILL-8136) Overhaul implict type casting logic
[ https://issues.apache.org/jira/browse/DRILL-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602355#comment-17602355 ] ASF GitHub Bot commented on DRILL-8136: --- cgivre commented on PR #2638: URL: https://github.com/apache/drill/pull/2638#issuecomment-1241993355 @jnturton Did you see my question about boolean values? > Overhaul implict type casting logic > --- > > Key: DRILL-8136 > URL: https://issues.apache.org/jira/browse/DRILL-8136 > Project: Apache Drill > Issue Type: Improvement >Reporter: Esther Buchwalter >Assignee: James Turton >Priority: Minor > > The existing implicit casting system is built on simplistic total ordering of > data types[1] that yields oddities such as TINYINT being regarded as the > closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in > turn, hurts the range of data types with which SQL functions can be used. > E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, > confusingly, `select '123' + 456` does work in Drill. In addition the > limitations of the existing type precedence list mean that it has been > supplmented with ad hoc secondary casting rules that go in the opposite > direction. > This Issue proposes a new, more flexible definition of casting distance based > on a weighted directed graph built over the Drill data types. > [1] > [https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8136) Overhaul implict type casting logic
[ https://issues.apache.org/jira/browse/DRILL-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602354#comment-17602354 ] ASF GitHub Bot commented on DRILL-8136: --- cgivre commented on PR #2638: URL: https://github.com/apache/drill/pull/2638#issuecomment-1241992856 > I wonder whether ResolverTypePrecedence should cache computed casting costs, or whether that would be premature optimisation. Maybe make that a separate PR. In theory the whole graph could be pre-computed right? > Overhaul implict type casting logic > --- > > Key: DRILL-8136 > URL: https://issues.apache.org/jira/browse/DRILL-8136 > Project: Apache Drill > Issue Type: Improvement >Reporter: Esther Buchwalter >Assignee: James Turton >Priority: Minor > > The existing implicit casting system is built on simplistic total ordering of > data types[1] that yields oddities such as TINYINT being regarded as the > closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in > turn, hurts the range of data types with which SQL functions can be used. > E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, > confusingly, `select '123' + 456` does work in Drill. In addition the > limitations of the existing type precedence list mean that it has been > supplmented with ad hoc secondary casting rules that go in the opposite > direction. > This Issue proposes a new, more flexible definition of casting distance based > on a weighted directed graph built over the Drill data types. > [1] > [https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8136) Overhaul implict type casting logic
[ https://issues.apache.org/jira/browse/DRILL-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602352#comment-17602352 ] ASF GitHub Bot commented on DRILL-8136: --- jnturton commented on PR #2638: URL: https://github.com/apache/drill/pull/2638#issuecomment-1241991578 I wonder whether ResolverTypePrecedence should cache computed casting costs, or whether that would be premature optimisation. > Overhaul implict type casting logic > --- > > Key: DRILL-8136 > URL: https://issues.apache.org/jira/browse/DRILL-8136 > Project: Apache Drill > Issue Type: Improvement >Reporter: Esther Buchwalter >Assignee: James Turton >Priority: Minor > > The existing implicit casting system is built on simplistic total ordering of > data types[1] that yields oddities such as TINYINT being regarded as the > closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in > turn, hurts the range of data types with which SQL functions can be used. > E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, > confusingly, `select '123' + 456` does work in Drill. In addition the > limitations of the existing type precedence list mean that it has been > supplmented with ad hoc secondary casting rules that go in the opposite > direction. > This Issue proposes a new, more flexible definition of casting distance based > on a weighted directed graph built over the Drill data types. > [1] > [https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8293) Add a docker-compose file to run Drill in cluster mode
[ https://issues.apache.org/jira/browse/DRILL-8293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602118#comment-17602118 ] ASF GitHub Bot commented on DRILL-8293: --- jnturton merged PR #2640: URL: https://github.com/apache/drill/pull/2640 > Add a docker-compose file to run Drill in cluster mode > -- > > Key: DRILL-8293 > URL: https://issues.apache.org/jira/browse/DRILL-8293 > Project: Apache Drill > Issue Type: Improvement > Components: Server >Affects Versions: 1.20.2 >Reporter: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Add a docker-compose file based on the official Docker images but overriding > the ENTRYPOINT to launch Drill in cluster mode and including a ZooKeeper > container. This can be used to experiment with cluster mode on a single > machine, or to run a real cluster on platforms that work with docker-compose > like Docker Swarm or ECS. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8295) Probable resource leak in the HTTP storage plugin
[ https://issues.apache.org/jira/browse/DRILL-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602019#comment-17602019 ] ASF GitHub Bot commented on DRILL-8295: --- cgivre merged PR #2641: URL: https://github.com/apache/drill/pull/2641 > Probable resource leak in the HTTP storage plugin > - > > Key: DRILL-8295 > URL: https://issues.apache.org/jira/browse/DRILL-8295 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.20.3 > > > It looks to me like SimpleHttp does not always close objects created using > OkHttp, e.g. line 378. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8136) Overhaul implict type casting logic
[ https://issues.apache.org/jira/browse/DRILL-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601847#comment-17601847 ] ASF GitHub Bot commented on DRILL-8136: --- cgivre commented on PR #2638: URL: https://github.com/apache/drill/pull/2638#issuecomment-1240788876 > This is really a MAJOR usability improvement. Will it also be able to cast `"true"` and `"false"` as boolean values? Likewise for: * True * TRUE * TrUe etc > Overhaul implict type casting logic > --- > > Key: DRILL-8136 > URL: https://issues.apache.org/jira/browse/DRILL-8136 > Project: Apache Drill > Issue Type: Improvement >Reporter: Esther Buchwalter >Assignee: James Turton >Priority: Minor > > The existing implicit casting system is built on simplistic total ordering of > data types[1] that yields oddities such as TINYINT being regarded as the > closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in > turn, hurts the range of data types with which SQL functions can be used. > E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, > confusingly, `select '123' + 456` does work in Drill. In addition the > limitations of the existing type precedence list mean that it has been > supplmented with ad hoc secondary casting rules that go in the opposite > direction. > This Issue proposes a new, more flexible definition of casting distance based > on a weighted directed graph built over the Drill data types. > [1] > [https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8136) Overhaul implict type casting logic
[ https://issues.apache.org/jira/browse/DRILL-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601846#comment-17601846 ] ASF GitHub Bot commented on DRILL-8136: --- jnturton commented on PR #2638: URL: https://github.com/apache/drill/pull/2638#issuecomment-1240784918 > > > > That brings a tear to me eye! A piece I haven't added is a cast function implementation going from BIT to INT using the normal 0 = false, 1 = true correspondence to enable little conveniences like taking the sum or average of a boolean. I enjoyed using tricks like that in Impala IIRC. But the new casting logic here does provide for this, all that's missing is the cast function itself: ``` Error: Missing function implementation: [castINT(BIT-OPTIONAL)] `` > Overhaul implict type casting logic > --- > > Key: DRILL-8136 > URL: https://issues.apache.org/jira/browse/DRILL-8136 > Project: Apache Drill > Issue Type: Improvement >Reporter: Esther Buchwalter >Assignee: James Turton >Priority: Minor > > The existing implicit casting system is built on simplistic total ordering of > data types[1] that yields oddities such as TINYINT being regarded as the > closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in > turn, hurts the range of data types with which SQL functions can be used. > E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, > confusingly, `select '123' + 456` does work in Drill. In addition the > limitations of the existing type precedence list mean that it has been > supplmented with ad hoc secondary casting rules that go in the opposite > direction. > This Issue proposes a new, more flexible definition of casting distance based > on a weighted directed graph built over the Drill data types. > [1] > [https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8136) Overhaul implict type casting logic
[ https://issues.apache.org/jira/browse/DRILL-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601843#comment-17601843 ] ASF GitHub Bot commented on DRILL-8136: --- cgivre commented on PR #2638: URL: https://github.com/apache/drill/pull/2638#issuecomment-1240779567 > That brings a tear to me eye! > Overhaul implict type casting logic > --- > > Key: DRILL-8136 > URL: https://issues.apache.org/jira/browse/DRILL-8136 > Project: Apache Drill > Issue Type: Improvement >Reporter: Esther Buchwalter >Assignee: James Turton >Priority: Minor > > The existing implicit casting system is built on simplistic total ordering of > data types[1] that yields oddities such as TINYINT being regarded as the > closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in > turn, hurts the range of data types with which SQL functions can be used. > E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, > confusingly, `select '123' + 456` does work in Drill. In addition the > limitations of the existing type precedence list mean that it has been > supplmented with ad hoc secondary casting rules that go in the opposite > direction. > This Issue proposes a new, more flexible definition of casting distance based > on a weighted directed graph built over the Drill data types. > [1] > [https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8136) Overhaul implict type casting logic
[ https://issues.apache.org/jira/browse/DRILL-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601842#comment-17601842 ] ASF GitHub Bot commented on DRILL-8136: --- jnturton commented on PR #2638: URL: https://github.com/apache/drill/pull/2638#issuecomment-1240777691 > Queries like that will work in MySQL and other RDBMS. In Drill I think they won't fail, but the results are not what people expect. For cases like this, would '2020-01-01' be automatically cast to a date? Would the same thing happen in situations like... ``` apache drill> select date_diff('2022-09-08', '1970-01-01'); EXPR$0 19243 days 0:00:00 1 row selected (0.157 seconds) apache drill> select sqrt('5'); EXPR$0 2.23606797749979 1 row selected (0.119 seconds) apache drill> select substring(current_date, 1, 4); EXPR$0 2022 1 row selected (0.146 seconds) apache drill> select now() > '2022-09-08'; EXPR$0 true ``` > Overhaul implict type casting logic > --- > > Key: DRILL-8136 > URL: https://issues.apache.org/jira/browse/DRILL-8136 > Project: Apache Drill > Issue Type: Improvement >Reporter: Esther Buchwalter >Assignee: James Turton >Priority: Minor > > The existing implicit casting system is built on simplistic total ordering of > data types[1] that yields oddities such as TINYINT being regarded as the > closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in > turn, hurts the range of data types with which SQL functions can be used. > E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, > confusingly, `select '123' + 456` does work in Drill. In addition the > limitations of the existing type precedence list mean that it has been > supplmented with ad hoc secondary casting rules that go in the opposite > direction. > This Issue proposes a new, more flexible definition of casting distance based > on a weighted directed graph built over the Drill data types. > [1] > [https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8295) Probable resource leak in the HTTP storage plugin
[ https://issues.apache.org/jira/browse/DRILL-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601796#comment-17601796 ] ASF GitHub Bot commented on DRILL-8295: --- cgivre commented on code in PR #2641: URL: https://github.com/apache/drill/pull/2641#discussion_r965897296 ## contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/udfs/HttpHelperFunctions.java: ## @@ -189,6 +191,8 @@ public void eval() { rowWriter.start(); if (jsonLoader.parser().next()) { rowWriter.save(); +} else { Review Comment: That works for me :-) > Probable resource leak in the HTTP storage plugin > - > > Key: DRILL-8295 > URL: https://issues.apache.org/jira/browse/DRILL-8295 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.20.3 > > > It looks to me like SimpleHttp does not always close objects created using > OkHttp, e.g. line 378. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8301) Standardise on UTF-8 encoding for char to byte (and vice versa) conversions
[ https://issues.apache.org/jira/browse/DRILL-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601758#comment-17601758 ] ASF GitHub Bot commented on DRILL-8301: --- pjfanning commented on PR #2637: URL: https://github.com/apache/drill/pull/2637#issuecomment-1240555174 With jackson - JSON spec (https://www.ietf.org/rfc/rfc4627.txt) mandates unicode with utf-8 as default. XML mandates utf-8 as default. Quite rare in my experience to see other Unicode charsets used. Utf-8 encoding should use fewer bytes for Latin alphabet based text and numeric data. Java strings can now use utf-16 internally. I'm not sure if there is a performance impact using utf-16 instead of utf-8 (https://www.dariawan.com/tutorials/java/java-9-compact-string-and-string-new-methods/). My main concern is correctness and testability as opposed to performance. Choosing one encoding for externally facing data and another internally would introduce a lot of extra complexity and possibly confusion as to which to choose in certain scenarios - and possibly lower performance as you would often need to convert between the 2 encodings. > Standardise on UTF-8 encoding for char to byte (and vice versa) conversions > --- > > Key: DRILL-8301 > URL: https://issues.apache.org/jira/browse/DRILL-8301 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > Lots of Drill code uses UTF-8 explicitly. Lots more Drill code does not set > an explicit encoding which means it relies on the JVM default (which differs > by JVM install). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8301) Standardise on UTF-8 encoding for char to byte (and vice versa) conversions
[ https://issues.apache.org/jira/browse/DRILL-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601754#comment-17601754 ] ASF GitHub Bot commented on DRILL-8301: --- jnturton commented on PR #2637: URL: https://github.com/apache/drill/pull/2637#issuecomment-1240540040 I guess there are two different classes of character data. 1. Internal use character data where we can use whatever encoding we like and perhaps would choose based on performance (would that suggest UTF-16?). 2. Interchange character data that we share with the outside world, e.g. a JSON file that Drill wants to query. It feels like it would be nice if we can accept different encodings here. I wonder what Jackson and friends do w.r.t. character encodings. > Standardise on UTF-8 encoding for char to byte (and vice versa) conversions > --- > > Key: DRILL-8301 > URL: https://issues.apache.org/jira/browse/DRILL-8301 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > Lots of Drill code uses UTF-8 explicitly. Lots more Drill code does not set > an explicit encoding which means it relies on the JVM default (which differs > by JVM install). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8302) tidy up some char conversions
[ https://issues.apache.org/jira/browse/DRILL-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601749#comment-17601749 ] ASF GitHub Bot commented on DRILL-8302: --- pjfanning opened a new pull request, #2645: URL: https://github.com/apache/drill/pull/2645 ## Description Code tidy-up. ## Documentation (Please describe user-visible changes similar to what should appear in the Drill documentation.) ## Testing (Please describe how this PR has been tested.) > tidy up some char conversions > - > > Key: DRILL-8302 > URL: https://issues.apache.org/jira/browse/DRILL-8302 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > As part of DRILL-8301, I spotted code that could be tidied up. The aim of > this issue is to reduce the size of DRILL-8301 without introducing changes to > the char encodings. > * uses of a pattern like `new String("")` - IntelliJ and other tools > highlight this as unnecessary > * uses of `new String(bytes, StandardCharsets.UTF_8.name())` - better to use > `new String(bytes, StandardCharsets.UTF_8)` > * use Base64 encodeToString instead of case where we encode to bytes and > then do our own encoding of those bytes to a String > * Replace existing code with `Charset.forName("UTF-8")` to use > `StandardCharsets.UTF_8` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8117) Clean up deprecated Apache code in Drill
[ https://issues.apache.org/jira/browse/DRILL-8117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601733#comment-17601733 ] ASF GitHub Bot commented on DRILL-8117: --- jnturton commented on code in PR #2499: URL: https://github.com/apache/drill/pull/2499#discussion_r965700029 ## exec/java-exec/src/test/java/org/apache/drill/exec/impersonation/TestInboundImpersonation.java: ## @@ -156,22 +159,25 @@ public void unauthorizedTarget() throws Exception { @Test public void invalidPolicy() throws Exception { -thrownException.expect(new UserExceptionMatcher(UserBitShared.DrillPBError.ErrorType.VALIDATION, -"Invalid impersonation policies.")); +String query = "ALTER SYSTEM SET `%s`='%s'"; Review Comment: Did you try the following here? ``` client.alterSystem(...); try { // run test } finally { client.resetSystem(...); } ``` > Clean up deprecated Apache code in Drill > > > Key: DRILL-8117 > URL: https://issues.apache.org/jira/browse/DRILL-8117 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.20.1 >Reporter: Jingchuan Hu >Priority: Major > Fix For: 2.0.0 > > > Clean up and upgrade deprecated Apache code like: > Class PathChildrenCache in Class ZookeeperClient and Class StringEscapeUtils > in Class PlanStringBuilder > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8117) Clean up deprecated Apache code in Drill
[ https://issues.apache.org/jira/browse/DRILL-8117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601732#comment-17601732 ] ASF GitHub Bot commented on DRILL-8117: --- jnturton commented on PR #2499: URL: https://github.com/apache/drill/pull/2499#issuecomment-1240441921 Hi @kingswanwho everything here looks good to me, let's just see if we can replace the `ALTER SYSTEM`s with `client.alterSystem`s. > Clean up deprecated Apache code in Drill > > > Key: DRILL-8117 > URL: https://issues.apache.org/jira/browse/DRILL-8117 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.20.1 >Reporter: Jingchuan Hu >Priority: Major > Fix For: 2.0.0 > > > Clean up and upgrade deprecated Apache code like: > Class PathChildrenCache in Class ZookeeperClient and Class StringEscapeUtils > in Class PlanStringBuilder > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8300) Upgrade to snakeyaml 1.31 due to cve
[ https://issues.apache.org/jira/browse/DRILL-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601709#comment-17601709 ] ASF GitHub Bot commented on DRILL-8300: --- jnturton merged PR #2643: URL: https://github.com/apache/drill/pull/2643 > Upgrade to snakeyaml 1.31 due to cve > > > Key: DRILL-8300 > URL: https://issues.apache.org/jira/browse/DRILL-8300 > Project: Apache Drill > Issue Type: Bug >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-3mc7-4q67-w48m -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8295) Probable resource leak in the HTTP storage plugin
[ https://issues.apache.org/jira/browse/DRILL-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601636#comment-17601636 ] ASF GitHub Bot commented on DRILL-8295: --- jnturton commented on code in PR #2641: URL: https://github.com/apache/drill/pull/2641#discussion_r965517773 ## contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/udfs/HttpHelperFunctions.java: ## @@ -189,6 +191,8 @@ public void eval() { rowWriter.start(); if (jsonLoader.parser().next()) { rowWriter.save(); +} else { Review Comment: I fed the http_get function a string containing 50 million little JSON objects from sequence {"foo": 1} {"foo": 2} {"foo": 3}... and it got through it (took about 45s). I just don't know if that answers the right question. > Probable resource leak in the HTTP storage plugin > - > > Key: DRILL-8295 > URL: https://issues.apache.org/jira/browse/DRILL-8295 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.20.3 > > > It looks to me like SimpleHttp does not always close objects created using > OkHttp, e.g. line 378. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601630#comment-17601630 ] ASF GitHub Bot commented on DRILL-8283: --- jnturton merged PR #2632: URL: https://github.com/apache/drill/pull/2632 > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8295) Probable resource leak in the HTTP storage plugin
[ https://issues.apache.org/jira/browse/DRILL-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601341#comment-17601341 ] ASF GitHub Bot commented on DRILL-8295: --- jnturton commented on code in PR #2641: URL: https://github.com/apache/drill/pull/2641#discussion_r964935897 ## contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/udfs/HttpHelperFunctions.java: ## @@ -189,6 +191,8 @@ public void eval() { rowWriter.start(); if (jsonLoader.parser().next()) { rowWriter.save(); +} else { Review Comment: @cgivre yes, you're right. I tried a couple of things. First I provided a JSON response that would normally produce 64k+1 rows if queried to http_get but it looked to me like it was being handled in a single batch since, I guess, the row count of a query based on VALUES(1) is still 1. I then wrote a query to `SELECT http_get(some simple JSON)` from a mock table containing 64k+1 rows. This overwhelms the okhttp3 mock server and fails with a timeout. I'm not sure if there some other test to try here? > Probable resource leak in the HTTP storage plugin > - > > Key: DRILL-8295 > URL: https://issues.apache.org/jira/browse/DRILL-8295 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.20.3 > > > It looks to me like SimpleHttp does not always close objects created using > OkHttp, e.g. line 378. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8295) Probable resource leak in the HTTP storage plugin
[ https://issues.apache.org/jira/browse/DRILL-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601340#comment-17601340 ] ASF GitHub Bot commented on DRILL-8295: --- jnturton commented on code in PR #2641: URL: https://github.com/apache/drill/pull/2641#discussion_r964935897 ## contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/udfs/HttpHelperFunctions.java: ## @@ -189,6 +191,8 @@ public void eval() { rowWriter.start(); if (jsonLoader.parser().next()) { rowWriter.save(); +} else { Review Comment: @cgivre yes, you're right. I tried a couple of things. First I provided a JSON response that would normally produce 64k+1 rows if queried to http_get but it looked to me like it was being handled in a single batch since, I guess, the row count of the query is still 1. I then wrote a query to `SELECT http_get(some simple JSON)` from a mock table containing 64k+1 rows. This overwhelms the okhttp3 mock server and fails with a timeout. I'm not sure if there some other test to try here? > Probable resource leak in the HTTP storage plugin > - > > Key: DRILL-8295 > URL: https://issues.apache.org/jira/browse/DRILL-8295 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.20.3 > > > It looks to me like SimpleHttp does not always close objects created using > OkHttp, e.g. line 378. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8295) Probable resource leak in the HTTP storage plugin
[ https://issues.apache.org/jira/browse/DRILL-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601288#comment-17601288 ] ASF GitHub Bot commented on DRILL-8295: --- cgivre commented on code in PR #2641: URL: https://github.com/apache/drill/pull/2641#discussion_r964788631 ## contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/udfs/HttpHelperFunctions.java: ## @@ -189,6 +191,8 @@ public void eval() { rowWriter.start(); if (jsonLoader.parser().next()) { rowWriter.save(); +} else { Review Comment: From my recollection, this function does handle the multiple batches. It was the `convert_fromJSON` that @vdiravka was working on. > Probable resource leak in the HTTP storage plugin > - > > Key: DRILL-8295 > URL: https://issues.apache.org/jira/browse/DRILL-8295 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.20.3 > > > It looks to me like SimpleHttp does not always close objects created using > OkHttp, e.g. line 378. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8300) Upgrade to snakeyaml 1.31 due to cve
[ https://issues.apache.org/jira/browse/DRILL-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601286#comment-17601286 ] ASF GitHub Bot commented on DRILL-8300: --- pjfanning opened a new pull request, #2643: URL: https://github.com/apache/drill/pull/2643 ## Description Snakeyaml has a CVE > Upgrade to snakeyaml 1.31 due to cve > > > Key: DRILL-8300 > URL: https://issues.apache.org/jira/browse/DRILL-8300 > Project: Apache Drill > Issue Type: Bug >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-3mc7-4q67-w48m -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8295) Probable resource leak in the HTTP storage plugin
[ https://issues.apache.org/jira/browse/DRILL-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601275#comment-17601275 ] ASF GitHub Bot commented on DRILL-8295: --- jnturton commented on code in PR #2641: URL: https://github.com/apache/drill/pull/2641#discussion_r964747549 ## contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/udfs/HttpHelperFunctions.java: ## @@ -189,6 +191,8 @@ public void eval() { rowWriter.start(); if (jsonLoader.parser().next()) { rowWriter.save(); +} else { Review Comment: @cgivre 1. The JsonLoader closes the input streams it's been working off of when it is closed so I don't think so. 2. Multiple batch datasets do not work with these UDFs yet from what I recall? I think @vdiravka continues to work on that, perhaps he can comment on the closing of the JsonLoader here. > Probable resource leak in the HTTP storage plugin > - > > Key: DRILL-8295 > URL: https://issues.apache.org/jira/browse/DRILL-8295 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.20.3 > > > It looks to me like SimpleHttp does not always close objects created using > OkHttp, e.g. line 378. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8295) Probable resource leak in the HTTP storage plugin
[ https://issues.apache.org/jira/browse/DRILL-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601259#comment-17601259 ] ASF GitHub Bot commented on DRILL-8295: --- cgivre commented on code in PR #2641: URL: https://github.com/apache/drill/pull/2641#discussion_r964729134 ## contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/udfs/HttpHelperFunctions.java: ## @@ -189,6 +191,8 @@ public void eval() { rowWriter.start(); if (jsonLoader.parser().next()) { rowWriter.save(); +} else { Review Comment: Should we explicitly close the `results` `InputStream` here as well? Would mind testing this on a query that produces multiple batches? > Probable resource leak in the HTTP storage plugin > - > > Key: DRILL-8295 > URL: https://issues.apache.org/jira/browse/DRILL-8295 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.20.3 > > > It looks to me like SimpleHttp does not always close objects created using > OkHttp, e.g. line 378. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8295) Probable resource leak in the HTTP storage plugin
[ https://issues.apache.org/jira/browse/DRILL-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601245#comment-17601245 ] ASF GitHub Bot commented on DRILL-8295: --- jnturton commented on PR #2641: URL: https://github.com/apache/drill/pull/2641#issuecomment-1239237242 > @jnturton there are also similar issues in > > * org.apache.drill.exec.store.http.util.SimpleHttp > > * org.apache.drill.exec.store.http.oauth.OAuthUtils @pjfanning thanks I picked up a couple of extra instances. > Probable resource leak in the HTTP storage plugin > - > > Key: DRILL-8295 > URL: https://issues.apache.org/jira/browse/DRILL-8295 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.20.3 > > > It looks to me like SimpleHttp does not always close objects created using > OkHttp, e.g. line 378. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8295) Probable resource leak in the HTTP storage plugin
[ https://issues.apache.org/jira/browse/DRILL-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601236#comment-17601236 ] ASF GitHub Bot commented on DRILL-8295: --- pjfanning commented on PR #2641: URL: https://github.com/apache/drill/pull/2641#issuecomment-1239212973 @jnturton there are also similar issues in * org.apache.drill.exec.store.http.util.SimpleHttp * org.apache.drill.exec.store.http.oauth.OAuthUtils > Probable resource leak in the HTTP storage plugin > - > > Key: DRILL-8295 > URL: https://issues.apache.org/jira/browse/DRILL-8295 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.20.3 > > > It looks to me like SimpleHttp does not always close objects created using > OkHttp, e.g. line 378. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8295) Probable resource leak in the HTTP storage plugin
[ https://issues.apache.org/jira/browse/DRILL-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601204#comment-17601204 ] ASF GitHub Bot commented on DRILL-8295: --- jnturton opened a new pull request, #2641: URL: https://github.com/apache/drill/pull/2641 # [DRILL-8295](https://issues.apache.org/jira/browse/DRILL-8295): Probable resource leak in the HTTP storage plugin ## Description Adds close() calls in a number of places where HTTP requests are made in the HTTP storage plugin. ## Documentation N/A ## Testing Existing unit tests. > Probable resource leak in the HTTP storage plugin > - > > Key: DRILL-8295 > URL: https://issues.apache.org/jira/browse/DRILL-8295 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.20.3 > > > It looks to me like SimpleHttp does not always close objects created using > OkHttp, e.g. line 378. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8136) Overhaul implict type casting logic
[ https://issues.apache.org/jira/browse/DRILL-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600953#comment-17600953 ] ASF GitHub Bot commented on DRILL-8136: --- jnturton commented on PR #2638: URL: https://github.com/apache/drill/pull/2638#issuecomment-1238561477 > @jnturton Do you think this should be included in the backport to stable? My own thought is probably not since it changes the function matching process and there isn't any clear bug that it fixes. > Overhaul implict type casting logic > --- > > Key: DRILL-8136 > URL: https://issues.apache.org/jira/browse/DRILL-8136 > Project: Apache Drill > Issue Type: Improvement >Reporter: Esther Buchwalter >Assignee: James Turton >Priority: Minor > > The existing implicit casting system is built on simplistic total ordering of > data types[1] that yields oddities such as TINYINT being regarded as the > closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in > turn, hurts the range of data types with which SQL functions can be used. > E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, > confusingly, `select '123' + 456` does work in Drill. In addition the > limitations of the existing type precedence list mean that it has been > supplmented with ad hoc secondary casting rules that go in the opposite > direction. > This Issue proposes a new, more flexible definition of casting distance based > on a weighted directed graph built over the Drill data types. > [1] > [https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8136) Overhaul implict type casting logic
[ https://issues.apache.org/jira/browse/DRILL-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600951#comment-17600951 ] ASF GitHub Bot commented on DRILL-8136: --- cgivre commented on PR #2638: URL: https://github.com/apache/drill/pull/2638#issuecomment-1238556786 @jnturton Do you think this should be included in the backport to stable? > Overhaul implict type casting logic > --- > > Key: DRILL-8136 > URL: https://issues.apache.org/jira/browse/DRILL-8136 > Project: Apache Drill > Issue Type: Improvement >Reporter: Esther Buchwalter >Assignee: James Turton >Priority: Minor > > The existing implicit casting system is built on simplistic total ordering of > data types[1] that yields oddities such as TINYINT being regarded as the > closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in > turn, hurts the range of data types with which SQL functions can be used. > E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, > confusingly, `select '123' + 456` does work in Drill. In addition the > limitations of the existing type precedence list mean that it has been > supplmented with ad hoc secondary casting rules that go in the opposite > direction. > This Issue proposes a new, more flexible definition of casting distance based > on a weighted directed graph built over the Drill data types. > [1] > [https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8293) Add a docker-compose file to run Drill in cluster mode
[ https://issues.apache.org/jira/browse/DRILL-8293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600726#comment-17600726 ] ASF GitHub Bot commented on DRILL-8293: --- jnturton opened a new pull request, #2640: URL: https://github.com/apache/drill/pull/2640 # [DRILL-8293](https://issues.apache.org/jira/browse/DRILL-8293): Add a docker-compose file to run Drill in cluster mode ## Description This directory contains source code artifacts to launch Drill in cluster mode along with a ZooKeeper. The Drill image is based on a minor customisation of the official Drill image that switches it from an embedded to a cluster mode launch. Logging is redirected to stdout. In the docker-cluster-mode directory: 1. docker build -t apache/drill-cluster-mode 2. docker-compose up Then access the web UI at http://localhost:8047 or connect a JDBC client to jdbc:drill:drillbit=localhost or jdbc:drill:zk=localhost but note that you will need to make the drillbit container hostnames resolvable from the host to use a ZooKeeper JDBC URL. To launch a cluster of 3 Drillbits 3. docker-compose up --scale drillbit=3 but first note that to use docker-compose's "scale" feature to run multiple Drillbit containers on a single host you will need to remove the host port mappings from the compose file to prevent collisions (see the comments on the relevant lines in that file). Once the Drillbits are launched run `docker-compose ps` to list the ephemeral ports that have been allocated on the host. ## Documentation Add the above discussion to the Drill in Docker doc page. ## Testing Launch Drill using the provided commands and run queries. > Add a docker-compose file to run Drill in cluster mode > -- > > Key: DRILL-8293 > URL: https://issues.apache.org/jira/browse/DRILL-8293 > Project: Apache Drill > Issue Type: Improvement > Components: Server >Affects Versions: 1.20.2 >Reporter: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Add a docker-compose file based on the official Docker images but overriding > the ENTRYPOINT to launch Drill in cluster mode and including a ZooKeeper > container. This can be used to experiment with cluster mode on a single > machine, or to run a real cluster on platforms that work with docker-compose > like Docker Swarm or ECS. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8291) Allow case sensitive Filters in HTTP Plugin
[ https://issues.apache.org/jira/browse/DRILL-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600342#comment-17600342 ] ASF GitHub Bot commented on DRILL-8291: --- cgivre merged PR #2639: URL: https://github.com/apache/drill/pull/2639 > Allow case sensitive Filters in HTTP Plugin > --- > > Key: DRILL-8291 > URL: https://issues.apache.org/jira/browse/DRILL-8291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.20.3 > > > Some APIs will reject filter pushdowns if they are not in the correct case. > This PR adds a config option `caseSensitiveFilters` to the API config and > when set to true, preserves the case of the filters pushed down. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8275) Prevent the JDBC Client from creating spurious paths in Zookeeper
[ https://issues.apache.org/jira/browse/DRILL-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600211#comment-17600211 ] ASF GitHub Bot commented on DRILL-8275: --- jnturton merged PR #2617: URL: https://github.com/apache/drill/pull/2617 > Prevent the JDBC Client from creating spurious paths in Zookeeper > - > > Key: DRILL-8275 > URL: https://issues.apache.org/jira/browse/DRILL-8275 > Project: Apache Drill > Issue Type: Improvement > Components: Client - JDBC >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > Fix For: 2.0.0 > > > Use the ZK style on the connection string and the zkRoot does not match the > actual path of the cluster, then the client always creates a spurious path > (as a permanent) in the Zookeeper. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8275) Prevent the JDBC Client to create error path in Zookeeper
[ https://issues.apache.org/jira/browse/DRILL-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600210#comment-17600210 ] ASF GitHub Bot commented on DRILL-8275: --- jnturton commented on PR #2617: URL: https://github.com/apache/drill/pull/2617#issuecomment-1236561756 > @jnturton Do you think we should backport this PR to stable? @luocooong Do you have any opinion on that? I'm not sure if this qualifies as a bug fix or an improvement that should wait for Drill 2.0. This fix looks good for stable to me +1. > Prevent the JDBC Client to create error path in Zookeeper > - > > Key: DRILL-8275 > URL: https://issues.apache.org/jira/browse/DRILL-8275 > Project: Apache Drill > Issue Type: Improvement > Components: Client - JDBC >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > Fix For: 2.0.0 > > > Use the ZK style on the connection string and the zkRoot does not match the > actual path of the cluster, then the client always creates the error path (as > a permanent) in the Zookeeper. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8275) Prevent the JDBC Client to create error path in Zookeeper
[ https://issues.apache.org/jira/browse/DRILL-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600067#comment-17600067 ] ASF GitHub Bot commented on DRILL-8275: --- cgivre commented on PR #2617: URL: https://github.com/apache/drill/pull/2617#issuecomment-1236355392 @jnturton Do you think we should backport this PR to stable? @luocooong Do you have any opinion on that? I'm not sure if this qualifies as a bug fix or an improvement that should wait for Drill 2.0. > Prevent the JDBC Client to create error path in Zookeeper > - > > Key: DRILL-8275 > URL: https://issues.apache.org/jira/browse/DRILL-8275 > Project: Apache Drill > Issue Type: Improvement > Components: Client - JDBC >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > Fix For: 2.0.0 > > > Use the ZK style on the connection string and the zkRoot does not match the > actual path of the cluster, then the client always creates the error path (as > a permanent) in the Zookeeper. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8136) Overhaul implict type casting logic
[ https://issues.apache.org/jira/browse/DRILL-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600062#comment-17600062 ] ASF GitHub Bot commented on DRILL-8136: --- cgivre commented on PR #2638: URL: https://github.com/apache/drill/pull/2638#issuecomment-1236342717 @jnturton Thanks for this. IMHO this will be a MAJOR improvement in usability. I have a question about date conversions. Let's say we have a query like this: ```sql SELECT... FROM ... WHERE dateField > '2020-01-01' ``` Queries like that will work in MySQL and other RDBMS. In Drill I think they won't fail, but the results are not what people expect. For cases like this, would `'2020-01-01'` be automatically cast to a date? Would the same thing happen in situations like: ``` DATE_DIFF('2020-01-01', '2021-01-01') ``` > Overhaul implict type casting logic > --- > > Key: DRILL-8136 > URL: https://issues.apache.org/jira/browse/DRILL-8136 > Project: Apache Drill > Issue Type: Improvement >Reporter: Esther Buchwalter >Assignee: James Turton >Priority: Minor > > The existing implicit casting system is built on simplistic total ordering of > data types[1] that yields oddities such as TINYINT being regarded as the > closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in > turn, hurts the range of data types with which SQL functions can be used. > E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, > confusingly, `select '123' + 456` does work in Drill. In addition the > limitations of the existing type precedence list mean that it has been > supplmented with ad hoc secondary casting rules that go in the opposite > direction. > This Issue proposes a new, more flexible definition of casting distance based > on a weighted directed graph built over the Drill data types. > [1] > [https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8291) Allow case sensitive Filters in HTTP Plugin
[ https://issues.apache.org/jira/browse/DRILL-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600010#comment-17600010 ] ASF GitHub Bot commented on DRILL-8291: --- cgivre opened a new pull request, #2639: URL: https://github.com/apache/drill/pull/2639 # [DRILL-8291](https://issues.apache.org/jira/browse/DRILL-8291): PR Title ## Description Some APIs will reject filter pushdowns if they are not in the correct case. This PR adds a config option `caseSensitiveFilters` to the API config and when set to `true`, preserves the case of the filters pushed down. ## Documentation See above. ## Testing Manually tested > Allow case sensitive Filters in HTTP Plugin > --- > > Key: DRILL-8291 > URL: https://issues.apache.org/jira/browse/DRILL-8291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 1.20.3 > > > Some APIs will reject filter pushdowns if they are not in the correct case. > This PR adds a config option `caseSensitiveFilters` to the API config and > when set to true, preserves the case of the filters pushed down. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8136) Overhaul implict type casting logic
[ https://issues.apache.org/jira/browse/DRILL-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599848#comment-17599848 ] ASF GitHub Bot commented on DRILL-8136: --- jnturton opened a new pull request, #2638: URL: https://github.com/apache/drill/pull/2638 # [DRILL-8136](https://issues.apache.org/jira/browse/DRILL-8136): Overhaul implict type casting logic ## Description The existing implicit casting system is built on simplistic total ordering of data types[1] that yields oddities such as TINYINT being regarded as the closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in turn, hurts the range of data types with which SQL functions can be used. E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, confusingly, `select '123' + 456` does work in Drill. In addition the limitations of the existing type precedence list mean that it has been supplmented with ad hoc secondary casting rules that go in the opposite direction. This PR introduces a new, more flexible definition of casting distance based on a weighted directed graph built over the Drill data types. ## Documentation Update [the description of implcit casting precedence](https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types). ## Testing Existing implcit cast unit tests plus new additions. > Overhaul implict type casting logic > --- > > Key: DRILL-8136 > URL: https://issues.apache.org/jira/browse/DRILL-8136 > Project: Apache Drill > Issue Type: Improvement >Reporter: Esther Buchwalter >Assignee: James Turton >Priority: Minor > > The existing implicit casting system is built on simplistic total ordering of > data types[1] that yields oddities such as TINYINT being regarded as the > closest numeric type to VARCHAR or DATE the closest type to FLOAT8. This, in > turn, hurts the range of data types with which SQL functions can be used. > E.g. `select sqrt('3.1415926')` works in many RDBMSes but not in Drill while, > confusingly, `select '123' + 456` does work in Drill. In addition the > limitations of the existing type precedence list mean that it has been > supplmented with ad hoc secondary casting rules that go in the opposite > direction. > This Issue proposes a new, more flexible definition of casting distance based > on a weighted directed graph built over the Drill data types. > [1] > [https://drill.apache.org/docs/supported-data-types/#implicit-casting-precedence-of-data-types] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8289) Add Threat Hunting Functions
[ https://issues.apache.org/jira/browse/DRILL-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599125#comment-17599125 ] ASF GitHub Bot commented on DRILL-8289: --- cgivre merged PR #2634: URL: https://github.com/apache/drill/pull/2634 > Add Threat Hunting Functions > > > Key: DRILL-8289 > URL: https://issues.apache.org/jira/browse/DRILL-8289 > Project: Apache Drill > Issue Type: New Feature > Components: Functions - Drill >Affects Versions: 2.0.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > # Threat Hunting Functions > These functions are useful for doing threat hunting with Apache Drill. These > were inspired by huntlib.[1] > The functions are: > * `punctuation_pattern()`: Extracts the pattern of punctuation in > text. > * `entropy()`: This function calculates the Shannon Entropy of a > given string of text. > * `entropyPerByte()`: This function calculates the Shannon Entropy of > a given string of text, normed for the string length. > [1]: https://github.com/target/huntlib -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8289) Add Threat Hunting Functions
[ https://issues.apache.org/jira/browse/DRILL-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599123#comment-17599123 ] ASF GitHub Bot commented on DRILL-8289: --- pjfanning commented on PR #2634: URL: https://github.com/apache/drill/pull/2634#issuecomment-1234687747 @cgivre lgtm > Add Threat Hunting Functions > > > Key: DRILL-8289 > URL: https://issues.apache.org/jira/browse/DRILL-8289 > Project: Apache Drill > Issue Type: New Feature > Components: Functions - Drill >Affects Versions: 2.0.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > # Threat Hunting Functions > These functions are useful for doing threat hunting with Apache Drill. These > were inspired by huntlib.[1] > The functions are: > * `punctuation_pattern()`: Extracts the pattern of punctuation in > text. > * `entropy()`: This function calculates the Shannon Entropy of a > given string of text. > * `entropyPerByte()`: This function calculates the Shannon Entropy of > a given string of text, normed for the string length. > [1]: https://github.com/target/huntlib -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8289) Add Threat Hunting Functions
[ https://issues.apache.org/jira/browse/DRILL-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599121#comment-17599121 ] ASF GitHub Bot commented on DRILL-8289: --- cgivre commented on PR #2634: URL: https://github.com/apache/drill/pull/2634#issuecomment-1234683441 @pjfanning are we ready to merge this? > Add Threat Hunting Functions > > > Key: DRILL-8289 > URL: https://issues.apache.org/jira/browse/DRILL-8289 > Project: Apache Drill > Issue Type: New Feature > Components: Functions - Drill >Affects Versions: 2.0.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > # Threat Hunting Functions > These functions are useful for doing threat hunting with Apache Drill. These > were inspired by huntlib.[1] > The functions are: > * `punctuation_pattern()`: Extracts the pattern of punctuation in > text. > * `entropy()`: This function calculates the Shannon Entropy of a > given string of text. > * `entropyPerByte()`: This function calculates the Shannon Entropy of > a given string of text, normed for the string length. > [1]: https://github.com/target/huntlib -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8290) Short cut recursive file listings for LIMIT 0 queries
[ https://issues.apache.org/jira/browse/DRILL-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598995#comment-17598995 ] ASF GitHub Bot commented on DRILL-8290: --- jnturton commented on PR #2636: URL: https://github.com/apache/drill/pull/2636#issuecomment-1234361083 @vvysotskyi I did spot one [other recursive file listing](https://github.com/jnturton/drill/blob/65fb7ddc144ecae5330c9325af63010748f74cdf/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/metadata/Metadata.java#L376) that could possibly use the short cut in this PR if I propagate a `limit0` flag down to it. It appears to be invoked only if there are Parquet files present at the top level of the queried path which I don't think should be too common for big datasets since data files are generally only present at the leaves of the directory tree. So I thought I'd ask if you think it's worth trying to implement the single file short cut here too, or we just leave it alone? > Short cut recursive file listings for LIMIT 0 queries > - > > Key: DRILL-8290 > URL: https://issues.apache.org/jira/browse/DRILL-8290 > Project: Apache Drill > Issue Type: Improvement > Components: Query Planning Optimization >Affects Versions: 1.20.2 >Reporter: James Turton >Priority: Minor > Fix For: 2.0.0 > > > The existing LIMIT 0 query optimisations do not prevent a query run against > the top of a deep DFS directory tree from recursively listing FileStatuses > for everything within it using a pool of worker threads. This Issue proposes > a new optimisation whereby such queries will recurse into the directory tree > on a single thread that returns as soon as any single FileStatus has been > obtained. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598896#comment-17598896 ] ASF GitHub Bot commented on DRILL-8283: --- jnturton commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r960531148 ## exec/java-exec/src/main/java/org/apache/drill/exec/util/FileSystemUtil.java: ## @@ -42,6 +47,12 @@ public class FileSystemUtil { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FileSystemUtil.class); + private static int recursiveListingMaxSize; + + static { +recursiveListingMaxSize = DrillConfig.create().getInt(ExecConstants.RECURSIVE_FILE_LISTING_MAX_SIZE); + } Review Comment: It's gone now. > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598882#comment-17598882 ] ASF GitHub Bot commented on DRILL-8283: --- jnturton commented on PR #2632: URL: https://github.com/apache/drill/pull/2632#issuecomment-1234106371 I just went through an exercise to replace the `boolean recursive` parameter with a new `RecursionOpts recurOpts` that allows the specification of the max listing size, but that change rippled and in some cases callers _also_ don't have access to the Drill config. I now think the only reasonable way for this limit to reach the file listing utility classes is by it becoming an env var. > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8289) Add Threat Hunting Functions
[ https://issues.apache.org/jira/browse/DRILL-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598720#comment-17598720 ] ASF GitHub Bot commented on DRILL-8289: --- cgivre commented on code in PR #2634: URL: https://github.com/apache/drill/pull/2634#discussion_r960184271 ## contrib/udfs/src/main/java/org/apache/drill/exec/udfs/ThreatHuntingFunctions.java: ## @@ -0,0 +1,179 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.udfs; + +import io.netty.buffer.DrillBuf; +import org.apache.drill.exec.expr.DrillSimpleFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.holders.Float8Holder; +import org.apache.drill.exec.expr.holders.VarCharHolder; + +import javax.inject.Inject; + +public class ThreatHuntingFunctions { + /** + * Punctuation pattern is useful for comparing log entries. It extracts the all the punctuation and returns + * that pattern. Spaces are replaced with an underscore. + * + * Usage: SELECT punctuation_pattern( string ) FROM... + */ + @FunctionTemplate(names = {"punctuation_pattern", "punctuationPattern"}, +scope = FunctionTemplate.FunctionScope.SIMPLE, +nulls = FunctionTemplate.NullHandling.NULL_IF_NULL) + public static class PunctuationPatternFunction implements DrillSimpleFunc { + +@Param +VarCharHolder rawInput; + +@Output +VarCharHolder out; + +@Inject +DrillBuf buffer; + +@Override +public void setup() { +} + +@Override +public void eval() { + + String input = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(rawInput.start, rawInput.end, rawInput.buffer); + + String punctuationPattern = input.replaceAll("[a-zA-Z0-9]", ""); + punctuationPattern = punctuationPattern.replaceAll(" ", "_"); + + out.buffer = buffer; + out.start = 0; + out.end = punctuationPattern.getBytes().length; Review Comment: Fixed > Add Threat Hunting Functions > > > Key: DRILL-8289 > URL: https://issues.apache.org/jira/browse/DRILL-8289 > Project: Apache Drill > Issue Type: New Feature > Components: Functions - Drill >Affects Versions: 2.0.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > # Threat Hunting Functions > These functions are useful for doing threat hunting with Apache Drill. These > were inspired by huntlib.[1] > The functions are: > * `punctuation_pattern()`: Extracts the pattern of punctuation in > text. > * `entropy()`: This function calculates the Shannon Entropy of a > given string of text. > * `entropyPerByte()`: This function calculates the Shannon Entropy of > a given string of text, normed for the string length. > [1]: https://github.com/target/huntlib -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8289) Add Threat Hunting Functions
[ https://issues.apache.org/jira/browse/DRILL-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598721#comment-17598721 ] ASF GitHub Bot commented on DRILL-8289: --- cgivre commented on PR #2634: URL: https://github.com/apache/drill/pull/2634#issuecomment-1233689516 Thanks @pjfanning for the review. I addressed your review comments. > Add Threat Hunting Functions > > > Key: DRILL-8289 > URL: https://issues.apache.org/jira/browse/DRILL-8289 > Project: Apache Drill > Issue Type: New Feature > Components: Functions - Drill >Affects Versions: 2.0.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > # Threat Hunting Functions > These functions are useful for doing threat hunting with Apache Drill. These > were inspired by huntlib.[1] > The functions are: > * `punctuation_pattern()`: Extracts the pattern of punctuation in > text. > * `entropy()`: This function calculates the Shannon Entropy of a > given string of text. > * `entropyPerByte()`: This function calculates the Shannon Entropy of > a given string of text, normed for the string length. > [1]: https://github.com/target/huntlib -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8289) Add Threat Hunting Functions
[ https://issues.apache.org/jira/browse/DRILL-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598719#comment-17598719 ] ASF GitHub Bot commented on DRILL-8289: --- cgivre commented on code in PR #2634: URL: https://github.com/apache/drill/pull/2634#discussion_r960183139 ## contrib/udfs/src/main/java/org/apache/drill/exec/udfs/ThreatHuntingFunctions.java: ## @@ -0,0 +1,179 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.udfs; + +import io.netty.buffer.DrillBuf; +import org.apache.drill.exec.expr.DrillSimpleFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.holders.Float8Holder; +import org.apache.drill.exec.expr.holders.VarCharHolder; + +import javax.inject.Inject; + +public class ThreatHuntingFunctions { + /** + * Punctuation pattern is useful for comparing log entries. It extracts the all the punctuation and returns Review Comment: Fixed > Add Threat Hunting Functions > > > Key: DRILL-8289 > URL: https://issues.apache.org/jira/browse/DRILL-8289 > Project: Apache Drill > Issue Type: New Feature > Components: Functions - Drill >Affects Versions: 2.0.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > # Threat Hunting Functions > These functions are useful for doing threat hunting with Apache Drill. These > were inspired by huntlib.[1] > The functions are: > * `punctuation_pattern()`: Extracts the pattern of punctuation in > text. > * `entropy()`: This function calculates the Shannon Entropy of a > given string of text. > * `entropyPerByte()`: This function calculates the Shannon Entropy of > a given string of text, normed for the string length. > [1]: https://github.com/target/huntlib -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8290) Short circuit recursive file listings for LIMIT 0 queries
[ https://issues.apache.org/jira/browse/DRILL-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598408#comment-17598408 ] ASF GitHub Bot commented on DRILL-8290: --- jnturton opened a new pull request, #2636: URL: https://github.com/apache/drill/pull/2636 # [DRILL-8290](https://issues.apache.org/jira/browse/DRILL-8290): Short circuit recursive file listings for LIMIT 0 queries ## Description The existing LIMIT 0 query optimisations do not prevent a query run against the top of a deep DFS directory tree from recursively listing FileStatuses for everything within it using a pool of worker threads. This PR adds a new optimisation whereby such queries will recurse into the directory tree on a single thread that returns as soon as any single FileStatus has been obtained. ## Documentation Mention in the docs on LIMIT 0 optimisations. ## Testing TODO > Short circuit recursive file listings for LIMIT 0 queries > - > > Key: DRILL-8290 > URL: https://issues.apache.org/jira/browse/DRILL-8290 > Project: Apache Drill > Issue Type: Improvement > Components: Query Planning Optimization >Affects Versions: 1.20.2 >Reporter: James Turton >Priority: Minor > Fix For: 2.0.0 > > > The existing LIMIT 0 query optimisations do not prevent a query run against > the top of a deep DFS directory tree from recursively listing FileStatuses > for everything within it using a pool of worker threads. This Issue proposes > a new optimisation whereby such queries will recurse into the directory tree > on a single thread that returns as soon as any single FileStatus has been > obtained. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8259) Support advanced HBase persistence storage options
[ https://issues.apache.org/jira/browse/DRILL-8259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598227#comment-17598227 ] ASF GitHub Bot commented on DRILL-8259: --- Z0ltrix commented on code in PR #2596: URL: https://github.com/apache/drill/pull/2596#discussion_r959257458 ## contrib/storage-hbase/src/main/java/org/apache/drill/exec/store/hbase/config/HBasePersistentStoreProvider.java: ## @@ -20,116 +20,249 @@ import java.io.IOException; import java.util.Map; +import org.apache.drill.common.AutoCloseables; import org.apache.drill.common.exceptions.DrillRuntimeException; import org.apache.drill.exec.exception.StoreException; import org.apache.drill.exec.store.hbase.DrillHBaseConstants; import org.apache.drill.exec.store.sys.PersistentStore; import org.apache.drill.exec.store.sys.PersistentStoreConfig; import org.apache.drill.exec.store.sys.PersistentStoreRegistry; import org.apache.drill.exec.store.sys.store.provider.BasePersistentStoreProvider; +import org.apache.drill.shaded.guava.com.google.common.annotations.VisibleForTesting; +import org.apache.drill.shaded.guava.com.google.common.collect.Maps; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; -import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HConstants; -import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.TableName; import org.apache.hadoop.hbase.client.Admin; +import org.apache.hadoop.hbase.client.ColumnFamilyDescriptorBuilder; import org.apache.hadoop.hbase.client.Connection; import org.apache.hadoop.hbase.client.ConnectionFactory; +import org.apache.hadoop.hbase.client.Durability; import org.apache.hadoop.hbase.client.Table; +import org.apache.hadoop.hbase.client.TableDescriptor; +import org.apache.hadoop.hbase.client.TableDescriptorBuilder; +import org.apache.hadoop.hbase.io.compress.Compression.Algorithm; +import org.apache.hadoop.hbase.io.encoding.DataBlockEncoding; import org.apache.hadoop.hbase.util.Bytes; -import org.apache.drill.shaded.guava.com.google.common.annotations.VisibleForTesting; - public class HBasePersistentStoreProvider extends BasePersistentStoreProvider { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HBasePersistentStoreProvider.class); - static final byte[] FAMILY = Bytes.toBytes("s"); + public static final byte[] DEFAULT_FAMILY_NAME = Bytes.toBytes("s"); - static final byte[] QUALIFIER = Bytes.toBytes("d"); + public static final byte[] QUALIFIER_NAME = Bytes.toBytes("d"); + + private static final String HBASE_CLIENT_ID = "drill-hbase-persistent-store-client"; private final TableName hbaseTableName; + private final byte[] family; + + private Table hbaseTable; + private Configuration hbaseConf; - private Connection connection; + private final Map tableConfig; - private Table hbaseTable; + private final Map columnConfig; + private Connection connection; + + @SuppressWarnings("unchecked") public HBasePersistentStoreProvider(PersistentStoreRegistry registry) { -@SuppressWarnings("unchecked") -final Map config = (Map) registry.getConfig().getAnyRef(DrillHBaseConstants.SYS_STORE_PROVIDER_HBASE_CONFIG); -this.hbaseConf = HBaseConfiguration.create(); -this.hbaseConf.set(HConstants.HBASE_CLIENT_INSTANCE_ID, "drill-hbase-persistent-store-client"); -if (config != null) { - for (Map.Entry entry : config.entrySet()) { -this.hbaseConf.set(entry.getKey(), String.valueOf(entry.getValue())); +final Map hbaseConfig = (Map) registry.getConfig().getAnyRef(DrillHBaseConstants.SYS_STORE_PROVIDER_HBASE_CONFIG); +if (registry.getConfig().hasPath(DrillHBaseConstants.SYS_STORE_PROVIDER_HBASE_TABLE_CONFIG)) { + tableConfig = (Map) registry.getConfig().getAnyRef(DrillHBaseConstants.SYS_STORE_PROVIDER_HBASE_TABLE_CONFIG); +} else { + tableConfig = Maps.newHashMap(); +} +if (registry.getConfig().hasPath(DrillHBaseConstants.SYS_STORE_PROVIDER_HBASE_COLUMN_CONFIG)) { + columnConfig = (Map) registry.getConfig().getAnyRef(DrillHBaseConstants.SYS_STORE_PROVIDER_HBASE_COLUMN_CONFIG); +} else { + columnConfig = Maps.newHashMap(); +} +hbaseConf = HBaseConfiguration.create(); Review Comment: > As you know, HBase is a nightmare for operational services due to the complexity of the settings. The actual value in the above example is not a recommended value, no unique value is appropriate for every case, but is simply the type of value that this parameter has to fill, is "true/false", not "0/1". hi @luocooong im still worried about the defaults, escpecially when drill creates the table on his own... am i correcth that you dont set any defaults except SYS_STORE_PROVIDER_HBASE_TABLE, SYS_STORE_PROVIDER_HBASE_NAMESPACE and SYS_STORE_PROVIDER_HBASE_FAMILY?
[jira] [Commented] (DRILL-8259) Support advanced HBase persistence storage options
[ https://issues.apache.org/jira/browse/DRILL-8259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598224#comment-17598224 ] ASF GitHub Bot commented on DRILL-8259: --- Z0ltrix commented on PR #2596: URL: https://github.com/apache/drill/pull/2596#issuecomment-1232571619 > @Z0ltrix Would you mind doing a formal review on this PR? @luocooong asked me but I don't really have enough experience with HBase to comment intelligently on this. If you're already happy with this, all you have to do is leave a `+1`. sorry for the late response, i would love to do the review :) > Support advanced HBase persistence storage options > -- > > Key: DRILL-8259 > URL: https://issues.apache.org/jira/browse/DRILL-8259 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - HBase >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > Fix For: 2.0.0 > > > Its contents are as follows > {code:java} > sys.store.provider: { > class: "org.apache.drill.exec.store.hbase.config.HBasePStoreProvider", > hbase: { > table : "drill_store", > config: { > "hbase.zookeeper.quorum": "zk_host3,zk_host2,zk_host1", > "hbase.zookeeper.property.clientPort": "2181", > "zookeeper.znode.parent": "/hbase-test" > }, > table_config : { > "durability": "ASYNC_WAL", > "compaction_enabled": false, > "split_enabled": false, > "max_filesize": 10737418240, > "memstore_flushsize": 536870912 > }, > column_config : { > "versions": 1, > "ttl": 2626560, > "compression": "SNAPPY", > "blockcache": true, > "blocksize": 131072, > "data_block_encoding": "FAST_DIFF", > "in_memory": true, > "dfs_replication": 3 > } > } > }{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8289) Add Threat Hunting Functions
[ https://issues.apache.org/jira/browse/DRILL-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597394#comment-17597394 ] ASF GitHub Bot commented on DRILL-8289: --- pjfanning commented on code in PR #2634: URL: https://github.com/apache/drill/pull/2634#discussion_r957792833 ## contrib/udfs/src/main/java/org/apache/drill/exec/udfs/ThreatHuntingFunctions.java: ## @@ -0,0 +1,179 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.udfs; + +import io.netty.buffer.DrillBuf; +import org.apache.drill.exec.expr.DrillSimpleFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.holders.Float8Holder; +import org.apache.drill.exec.expr.holders.VarCharHolder; + +import javax.inject.Inject; + +public class ThreatHuntingFunctions { + /** + * Punctuation pattern is useful for comparing log entries. It extracts the all the punctuation and returns Review Comment: `the all the` should probably be `all the` > Add Threat Hunting Functions > > > Key: DRILL-8289 > URL: https://issues.apache.org/jira/browse/DRILL-8289 > Project: Apache Drill > Issue Type: New Feature > Components: Functions - Drill >Affects Versions: 2.0.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > # Threat Hunting Functions > These functions are useful for doing threat hunting with Apache Drill. These > were inspired by huntlib.[1] > The functions are: > * `punctuation_pattern()`: Extracts the pattern of punctuation in > text. > * `entropy()`: This function calculates the Shannon Entropy of a > given string of text. > * `entropyPerByte()`: This function calculates the Shannon Entropy of > a given string of text, normed for the string length. > [1]: https://github.com/target/huntlib -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8289) Add Threat Hunting Functions
[ https://issues.apache.org/jira/browse/DRILL-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597393#comment-17597393 ] ASF GitHub Bot commented on DRILL-8289: --- pjfanning commented on code in PR #2634: URL: https://github.com/apache/drill/pull/2634#discussion_r957792365 ## contrib/udfs/src/main/java/org/apache/drill/exec/udfs/ThreatHuntingFunctions.java: ## @@ -0,0 +1,179 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.udfs; + +import io.netty.buffer.DrillBuf; +import org.apache.drill.exec.expr.DrillSimpleFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.holders.Float8Holder; +import org.apache.drill.exec.expr.holders.VarCharHolder; + +import javax.inject.Inject; + +public class ThreatHuntingFunctions { + /** + * Punctuation pattern is useful for comparing log entries. It extracts the all the punctuation and returns + * that pattern. Spaces are replaced with an underscore. + * + * Usage: SELECT punctuation_pattern( string ) FROM... + */ + @FunctionTemplate(names = {"punctuation_pattern", "punctuationPattern"}, +scope = FunctionTemplate.FunctionScope.SIMPLE, +nulls = FunctionTemplate.NullHandling.NULL_IF_NULL) + public static class PunctuationPatternFunction implements DrillSimpleFunc { + +@Param +VarCharHolder rawInput; + +@Output +VarCharHolder out; + +@Inject +DrillBuf buffer; + +@Override +public void setup() { +} + +@Override +public void eval() { + + String input = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(rawInput.start, rawInput.end, rawInput.buffer); + + String punctuationPattern = input.replaceAll("[a-zA-Z0-9]", ""); + punctuationPattern = punctuationPattern.replaceAll(" ", "_"); + + out.buffer = buffer; + out.start = 0; + out.end = punctuationPattern.getBytes().length; Review Comment: getBytes is safer if you specify a charset, otherwise you get the JVM default which differs from machine to machine (unless Drill startup shell scripts specify `-Dfile.encoding=...`) > Add Threat Hunting Functions > > > Key: DRILL-8289 > URL: https://issues.apache.org/jira/browse/DRILL-8289 > Project: Apache Drill > Issue Type: New Feature > Components: Functions - Drill >Affects Versions: 2.0.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > # Threat Hunting Functions > These functions are useful for doing threat hunting with Apache Drill. These > were inspired by huntlib.[1] > The functions are: > * `punctuation_pattern()`: Extracts the pattern of punctuation in > text. > * `entropy()`: This function calculates the Shannon Entropy of a > given string of text. > * `entropyPerByte()`: This function calculates the Shannon Entropy of > a given string of text, normed for the string length. > [1]: https://github.com/target/huntlib -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8287) Add Support for Keyset Based Pagination
[ https://issues.apache.org/jira/browse/DRILL-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597327#comment-17597327 ] ASF GitHub Bot commented on DRILL-8287: --- cgivre merged PR #2633: URL: https://github.com/apache/drill/pull/2633 > Add Support for Keyset Based Pagination > --- > > Key: DRILL-8287 > URL: https://issues.apache.org/jira/browse/DRILL-8287 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > Some APIs such as HubSpot use values in the result set to indicate whether > there are additional pages. This PR adds support for this kind of > pagination. Note that current implementation only works for JSON based APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8287) Add Support for Keyset Based Pagination
[ https://issues.apache.org/jira/browse/DRILL-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597256#comment-17597256 ] ASF GitHub Bot commented on DRILL-8287: --- jnturton commented on code in PR #2633: URL: https://github.com/apache/drill/pull/2633#discussion_r957489232 ## exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/parser/SimpleMessageParser.java: ## @@ -129,6 +135,44 @@ private boolean parseInnerLevel(TokenIterator tokenizer, int level) throws Messa return parseToElement(tokenizer, level + 1); } + /** + * This function is called when a storage plugin needs to retrieve values which have been read. This logic + * enables use of the data path in these situations. Normally, when the datapath is defined, the JSON reader + * will "free-wheel" over unprojected columns or columns outside of the datapath. However, in this case, often + * the values which are being read, are outside the dataPath. This logic offers a way to capture these values + * without creating a ValueVector for them. + * + * @param tokenizer A {@link TokenIterator} of the parsed JSON data. + * @param fieldName A {@link String} of the pagination field name. Review Comment: ```suggestion * @param fieldName A {@link String} of the listener column name. ``` ## exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/TupleParser.java: ## @@ -127,10 +127,19 @@ public TupleParser(JsonLoaderImpl loader, TupleWriter tupleWriter, TupleMetadata @Override public ElementParser onField(String key, TokenIterator tokenizer) { -if (!tupleWriter.isProjected(key)) { +if (projectField(key)) { + return fieldParserFor(key, tokenizer); +} else { return fieldFactory().ignoredFieldParser(); +} + } + + private boolean projectField(String key) { +// This method makes sure that fields necessary for pagination are read. Review Comment: ```suggestion // This method makes sure that fields necessary for column listeners are read. ``` ## exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/values/ScalarListener.java: ## @@ -76,4 +79,30 @@ protected void setArrayNull() { protected UserException typeConversionError(String jsonType) { return loader.typeConversionError(schema(), jsonType); } + + /** + * Adds a field's most recent value to the column listener map. + * This data is only stored if the listener column map is defined, and has keys. + * @param key The key of the pagination field Review Comment: ```suggestion * @param key The key of the listener field ``` ## exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/values/ScalarListener.java: ## @@ -76,4 +79,30 @@ protected void setArrayNull() { protected UserException typeConversionError(String jsonType) { return loader.typeConversionError(schema(), jsonType); } + + /** + * Adds a field's most recent value to the column listener map. + * This data is only stored if the listener column map is defined, and has keys. + * @param key The key of the pagination field + * @param value The value of to be retained + */ + protected void addValueToListenerMap(String key, String value) { +Map listenerColumnMap = loader.listenerColumnMap(); + +if (listenerColumnMap == null || listenerColumnMap.isEmpty()) { + return; +} else if (listenerColumnMap.containsKey(key) && StringUtils.isNotEmpty(value)) { + listenerColumnMap.put(key, value); +} + } + + protected void addValueToListenerMap(String key, Object value) { +Map paginationMap = loader.listenerColumnMap(); Review Comment: ```suggestion Map listenerMap = loader.listenerColumnMap(); ``` > Add Support for Keyset Based Pagination > --- > > Key: DRILL-8287 > URL: https://issues.apache.org/jira/browse/DRILL-8287 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > Some APIs such as HubSpot use values in the result set to indicate whether > there are additional pages. This PR adds support for this kind of > pagination. Note that current implementation only works for JSON based APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8287) Add Support for Keyset Based Pagination
[ https://issues.apache.org/jira/browse/DRILL-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597242#comment-17597242 ] ASF GitHub Bot commented on DRILL-8287: --- cgivre commented on code in PR #2633: URL: https://github.com/apache/drill/pull/2633#discussion_r957448436 ## exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/values/ScalarListener.java: ## @@ -76,4 +79,33 @@ protected void setArrayNull() { protected UserException typeConversionError(String jsonType) { return loader.typeConversionError(schema(), jsonType); } + + /** + * Adds a field's most recent value to the pagination map. This is necessary for the HTTP plugin + * for index or keyset pagination where the API transmits values in the results that are used to + * generate the next page. + * + * This data is only stored if the pagination map is defined, and has keys. Review Comment: Done! ## exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/parser/SimpleMessageParser.java: ## @@ -66,11 +68,13 @@ public class SimpleMessageParser implements MessageParser { private final String[] path; + private final Map paginationFields; - public SimpleMessageParser(String dataPath) { + public SimpleMessageParser(String dataPath, Map paginationFields) { Review Comment: Done! > Add Support for Keyset Based Pagination > --- > > Key: DRILL-8287 > URL: https://issues.apache.org/jira/browse/DRILL-8287 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > Some APIs such as HubSpot use values in the result set to indicate whether > there are additional pages. This PR adds support for this kind of > pagination. Note that current implementation only works for JSON based APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8287) Add Support for Keyset Based Pagination
[ https://issues.apache.org/jira/browse/DRILL-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597234#comment-17597234 ] ASF GitHub Bot commented on DRILL-8287: --- jnturton commented on code in PR #2633: URL: https://github.com/apache/drill/pull/2633#discussion_r957433471 ## exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/values/ScalarListener.java: ## @@ -76,4 +79,33 @@ protected void setArrayNull() { protected UserException typeConversionError(String jsonType) { return loader.typeConversionError(schema(), jsonType); } + + /** + * Adds a field's most recent value to the pagination map. This is necessary for the HTTP plugin + * for index or keyset pagination where the API transmits values in the results that are used to + * generate the next page. + * + * This data is only stored if the pagination map is defined, and has keys. Review Comment: Can this be rewritten in terms of generic column listeners rather than pagination and the HTTP plugin? ## exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/parser/SimpleMessageParser.java: ## @@ -66,11 +68,13 @@ public class SimpleMessageParser implements MessageParser { private final String[] path; + private final Map paginationFields; - public SimpleMessageParser(String dataPath) { + public SimpleMessageParser(String dataPath, Map paginationFields) { Review Comment: Can we rename "pagination" here too? > Add Support for Keyset Based Pagination > --- > > Key: DRILL-8287 > URL: https://issues.apache.org/jira/browse/DRILL-8287 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > Some APIs such as HubSpot use values in the result set to indicate whether > there are additional pages. This PR adds support for this kind of > pagination. Note that current implementation only works for JSON based APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8282) Upgrade to hadoop-common 3.2.4 due to CVE
[ https://issues.apache.org/jira/browse/DRILL-8282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597229#comment-17597229 ] ASF GitHub Bot commented on DRILL-8282: --- jnturton merged PR #2630: URL: https://github.com/apache/drill/pull/2630 > Upgrade to hadoop-common 3.2.4 due to CVE > -- > > Key: DRILL-8282 > URL: https://issues.apache.org/jira/browse/DRILL-8282 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-8wm5-8h9c-47pc > * this change requires some reload4j dependency changes too - see broken > build - https://github.com/apache/drill/pull/2628 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8282) Upgrade to hadoop-common 3.2.4 due to CVE
[ https://issues.apache.org/jira/browse/DRILL-8282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597230#comment-17597230 ] ASF GitHub Bot commented on DRILL-8282: --- jnturton merged PR #2635: URL: https://github.com/apache/drill/pull/2635 > Upgrade to hadoop-common 3.2.4 due to CVE > -- > > Key: DRILL-8282 > URL: https://issues.apache.org/jira/browse/DRILL-8282 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-8wm5-8h9c-47pc > * this change requires some reload4j dependency changes too - see broken > build - https://github.com/apache/drill/pull/2628 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8287) Add Support for Keyset Based Pagination
[ https://issues.apache.org/jira/browse/DRILL-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597219#comment-17597219 ] ASF GitHub Bot commented on DRILL-8287: --- cgivre commented on PR #2633: URL: https://github.com/apache/drill/pull/2633#issuecomment-1230390598 @jnturton Thanks for the quick review! I addressed your comments. I actually reinserted the commented out block as that was intended to make sure that the user properly populates the pagination fields. Not sure why I commented that out in the first place. > Add Support for Keyset Based Pagination > --- > > Key: DRILL-8287 > URL: https://issues.apache.org/jira/browse/DRILL-8287 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > Some APIs such as HubSpot use values in the result set to indicate whether > there are additional pages. This PR adds support for this kind of > pagination. Note that current implementation only works for JSON based APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8287) Add Support for Keyset Based Pagination
[ https://issues.apache.org/jira/browse/DRILL-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597213#comment-17597213 ] ASF GitHub Bot commented on DRILL-8287: --- cgivre commented on PR #2633: URL: https://github.com/apache/drill/pull/2633#issuecomment-1230366234 > I'm not sure that the concept of pagination from the HTTP plugin should spill into the JSON reader. Can you abstract it, e.g. by renaming paginationMap to, say, listenerColumnMap? Fixed. > Add Support for Keyset Based Pagination > --- > > Key: DRILL-8287 > URL: https://issues.apache.org/jira/browse/DRILL-8287 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > Some APIs such as HubSpot use values in the result set to indicate whether > there are additional pages. This PR adds support for this kind of > pagination. Note that current implementation only works for JSON based APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8287) Add Support for Keyset Based Pagination
[ https://issues.apache.org/jira/browse/DRILL-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597206#comment-17597206 ] ASF GitHub Bot commented on DRILL-8287: --- cgivre commented on code in PR #2633: URL: https://github.com/apache/drill/pull/2633#discussion_r957380889 ## contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpPaginatorConfig.java: ## @@ -137,21 +162,28 @@ public String toString() { .field("pageSize", pageSize) .field("maxRecords", maxRecords) .field("method", method) + .field("indexParam", indexParam) + .field("hasMoreParam", hasMoreParam) + .field("nextPageParam", nextPageParam) .toString(); } public enum PaginatorMethod { OFFSET, -PAGE +PAGE, +INDEX } - private HttpPaginatorConfig(HttpPaginatorConfig.HttpPaginatorBuilder builder) { + /*private HttpPaginatorConfig(HttpPaginatorConfig.HttpPaginatorConfigBuilder builder) { Review Comment: Oops... Fixed. > Add Support for Keyset Based Pagination > --- > > Key: DRILL-8287 > URL: https://issues.apache.org/jira/browse/DRILL-8287 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > Some APIs such as HubSpot use values in the result set to indicate whether > there are additional pages. This PR adds support for this kind of > pagination. Note that current implementation only works for JSON based APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8287) Add Support for Keyset Based Pagination
[ https://issues.apache.org/jira/browse/DRILL-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597178#comment-17597178 ] ASF GitHub Bot commented on DRILL-8287: --- jnturton commented on code in PR #2633: URL: https://github.com/apache/drill/pull/2633#discussion_r957255881 ## contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpPaginatorConfig.java: ## @@ -137,21 +162,28 @@ public String toString() { .field("pageSize", pageSize) .field("maxRecords", maxRecords) .field("method", method) + .field("indexParam", indexParam) + .field("hasMoreParam", hasMoreParam) + .field("nextPageParam", nextPageParam) .toString(); } public enum PaginatorMethod { OFFSET, -PAGE +PAGE, +INDEX } - private HttpPaginatorConfig(HttpPaginatorConfig.HttpPaginatorBuilder builder) { + /*private HttpPaginatorConfig(HttpPaginatorConfig.HttpPaginatorConfigBuilder builder) { Review Comment: Is this commented out code meant to be included? > Add Support for Keyset Based Pagination > --- > > Key: DRILL-8287 > URL: https://issues.apache.org/jira/browse/DRILL-8287 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > Some APIs such as HubSpot use values in the result set to indicate whether > there are additional pages. This PR adds support for this kind of > pagination. Note that current implementation only works for JSON based APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8287) Add Support for Keyset Based Pagination
[ https://issues.apache.org/jira/browse/DRILL-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597174#comment-17597174 ] ASF GitHub Bot commented on DRILL-8287: --- jnturton commented on PR #2633: URL: https://github.com/apache/drill/pull/2633#issuecomment-1230204078 I'm not sure that the concept of pagination from the HTTP plugin should spill into the JSON reader. Can you abstract it, e.g. by renaming paginationMap to, say, listenerColumnMap? > Add Support for Keyset Based Pagination > --- > > Key: DRILL-8287 > URL: https://issues.apache.org/jira/browse/DRILL-8287 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > Some APIs such as HubSpot use values in the result set to indicate whether > there are additional pages. This PR adds support for this kind of > pagination. Note that current implementation only works for JSON based APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8282) Upgrade to hadoop-common 3.2.4 due to CVE
[ https://issues.apache.org/jira/browse/DRILL-8282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597084#comment-17597084 ] ASF GitHub Bot commented on DRILL-8282: --- jnturton opened a new pull request, #2635: URL: https://github.com/apache/drill/pull/2635 # [DRILL-8282](https://issues.apache.org/jira/browse/DRILL-8282): Update hadoop.dll and winutils.exe to 3.2.4. ## Description Completes #2630 by updating hadoop.dll and winutils.exe to 3.2.4. ## Documentation N/A ## Testing Launch Drill on Windows. > Upgrade to hadoop-common 3.2.4 due to CVE > -- > > Key: DRILL-8282 > URL: https://issues.apache.org/jira/browse/DRILL-8282 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-8wm5-8h9c-47pc > * this change requires some reload4j dependency changes too - see broken > build - https://github.com/apache/drill/pull/2628 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8282) Upgrade to hadoop-common 3.2.4 due to CVE
[ https://issues.apache.org/jira/browse/DRILL-8282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17596988#comment-17596988 ] ASF GitHub Bot commented on DRILL-8282: --- jnturton commented on PR #2630: URL: https://github.com/apache/drill/pull/2630#issuecomment-1229820379 We also need to update hadoop.dll and winutils.exe. > Upgrade to hadoop-common 3.2.4 due to CVE > -- > > Key: DRILL-8282 > URL: https://issues.apache.org/jira/browse/DRILL-8282 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-8wm5-8h9c-47pc > * this change requires some reload4j dependency changes too - see broken > build - https://github.com/apache/drill/pull/2628 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8282) Upgrade to hadoop-common 3.2.4 due to CVE
[ https://issues.apache.org/jira/browse/DRILL-8282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17596963#comment-17596963 ] ASF GitHub Bot commented on DRILL-8282: --- cgivre commented on PR #2630: URL: https://github.com/apache/drill/pull/2630#issuecomment-1229759970 @jnturton Are we good to merge this? > Upgrade to hadoop-common 3.2.4 due to CVE > -- > > Key: DRILL-8282 > URL: https://issues.apache.org/jira/browse/DRILL-8282 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-8wm5-8h9c-47pc > * this change requires some reload4j dependency changes too - see broken > build - https://github.com/apache/drill/pull/2628 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8289) Add Threat Hunting Functions
[ https://issues.apache.org/jira/browse/DRILL-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17596962#comment-17596962 ] ASF GitHub Bot commented on DRILL-8289: --- cgivre opened a new pull request, #2634: URL: https://github.com/apache/drill/pull/2634 # [DRILL-8289](https://issues.apache.org/jira/browse/DRILL-8289): Add Threat Hunting Functions ## Description See below. ## Documentation These functions are useful for doing threat hunting with Apache Drill. These were inspired by huntlib.[1] The functions are: * `punctuation_pattern()`: Extracts the pattern of punctuation in text. * `entropy()`: This function calculates the Shannon Entropy of a given string of text. * `entropyPerByte()`: This function calculates the Shannon Entropy of a given string of text, normed for the string length. [1]: https://github.com/target/huntlib ## Testing Added unit tests. > Add Threat Hunting Functions > > > Key: DRILL-8289 > URL: https://issues.apache.org/jira/browse/DRILL-8289 > Project: Apache Drill > Issue Type: New Feature > Components: Functions - Drill >Affects Versions: 2.0.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > # Threat Hunting Functions > These functions are useful for doing threat hunting with Apache Drill. These > were inspired by huntlib.[1] > The functions are: > * `punctuation_pattern()`: Extracts the pattern of punctuation in > text. > * `entropy()`: This function calculates the Shannon Entropy of a > given string of text. > * `entropyPerByte()`: This function calculates the Shannon Entropy of > a given string of text, normed for the string length. > [1]: https://github.com/target/huntlib -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-4232) Support for EXCEPT set operator
[ https://issues.apache.org/jira/browse/DRILL-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17586135#comment-17586135 ] ASF GitHub Bot commented on DRILL-4232: --- Leon-WTF commented on PR #2599: URL: https://github.com/apache/drill/pull/2599#issuecomment-1229365001 > @Leon-WTF Is this ready for review? @cgivre Not yet, I'm handling the EXCEPT case, it needs to remove the duplicate records for probe side, I'm trying to add an Agg phase after setop phase. The agg phase needs a flag to indicate that it needs to group by all columns as the columns can not be known when doing the plan. Any suggestion on this? > Support for EXCEPT set operator > --- > > Key: DRILL-4232 > URL: https://issues.apache.org/jira/browse/DRILL-4232 > Project: Apache Drill > Issue Type: New Feature > Components: Query Planning Optimization >Reporter: Victoria Markman >Assignee: Tengfei Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-4232) Support for EXCEPT set operator
[ https://issues.apache.org/jira/browse/DRILL-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17586130#comment-17586130 ] ASF GitHub Bot commented on DRILL-4232: --- cgivre commented on PR #2599: URL: https://github.com/apache/drill/pull/2599#issuecomment-1229353429 @Leon-WTF Is this ready for review? > Support for EXCEPT set operator > --- > > Key: DRILL-4232 > URL: https://issues.apache.org/jira/browse/DRILL-4232 > Project: Apache Drill > Issue Type: New Feature > Components: Query Planning Optimization >Reporter: Victoria Markman >Assignee: Tengfei Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8287) Add Support for Keyset Based Pagination
[ https://issues.apache.org/jira/browse/DRILL-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584917#comment-17584917 ] ASF GitHub Bot commented on DRILL-8287: --- cgivre opened a new pull request, #2633: URL: https://github.com/apache/drill/pull/2633 # [DRILL-8287](https://issues.apache.org/jira/browse/DRILL-8287): Add Support for Keyset Based Pagination ## Description Some APIs such as HubSpot use values in the result set to indicate whether there are additional pages. This PR adds support for this kind of pagination. Note that current implementation only works for JSON based APIs. This PR also addresses [DRILL-8286](https://issues.apache.org/jira/browse/DRILL-8286), which is a minor bugfix for the GoogleSheets config. ## Documentation Updated Pagination.md. ## Testing Added unit tests and manually tested against Hubspot API. > Add Support for Keyset Based Pagination > --- > > Key: DRILL-8287 > URL: https://issues.apache.org/jira/browse/DRILL-8287 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - HTTP >Affects Versions: 1.20.2 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Fix For: 2.0.0 > > > Some APIs such as HubSpot use values in the result set to indicate whether > there are additional pages. This PR adds support for this kind of > pagination. Note that current implementation only works for JSON based APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584825#comment-17584825 ] ASF GitHub Bot commented on DRILL-8283: --- jnturton commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r954914617 ## exec/java-exec/src/main/java/org/apache/drill/exec/util/FileSystemUtil.java: ## @@ -42,6 +47,12 @@ public class FileSystemUtil { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FileSystemUtil.class); + private static int recursiveListingMaxSize; + + static { +recursiveListingMaxSize = DrillConfig.create().getInt(ExecConstants.RECURSIVE_FILE_LISTING_MAX_SIZE); + } Review Comment: That it might be a heavyweight duplication of work was bothering me enough that I went and timed it. It takes about ~100ms when I start embedded Drill locally. That's just enough to make me wonder if it's worth trying to redesign this stuff so that it loads from an existing instance of DrillConfig instead of constructing its own. > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584753#comment-17584753 ] ASF GitHub Bot commented on DRILL-8283: --- jnturton commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r954781871 ## exec/java-exec/src/main/java/org/apache/drill/exec/util/FileSystemUtil.java: ## @@ -302,12 +332,32 @@ protected List compute() { List tasks = new ArrayList<>(); try { -for (FileStatus status : fs.listStatus(path, filter)) { +FileStatus[] dirFs = fs.listStatus(path, filter); +if (recursiveListingMaxSize > 0 && fileCounter.addAndGet(dirFs.length) > recursiveListingMaxSize) { + throw UserException Review Comment: @vvysotskyi I've added an attempt to do that now. > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584726#comment-17584726 ] ASF GitHub Bot commented on DRILL-8283: --- vvysotskyi commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r954723117 ## exec/java-exec/src/main/java/org/apache/drill/exec/util/FileSystemUtil.java: ## @@ -302,12 +332,32 @@ protected List compute() { List tasks = new ArrayList<>(); try { -for (FileStatus status : fs.listStatus(path, filter)) { +FileStatus[] dirFs = fs.listStatus(path, filter); +if (recursiveListingMaxSize > 0 && fileCounter.addAndGet(dirFs.length) > recursiveListingMaxSize) { + throw UserException Review Comment: This code is executed within fork join pool. Can we somehow stop executing all tasks for the case when count is reached? > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584725#comment-17584725 ] ASF GitHub Bot commented on DRILL-8283: --- vvysotskyi commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r954723117 ## exec/java-exec/src/main/java/org/apache/drill/exec/util/FileSystemUtil.java: ## @@ -302,12 +332,32 @@ protected List compute() { List tasks = new ArrayList<>(); try { -for (FileStatus status : fs.listStatus(path, filter)) { +FileStatus[] dirFs = fs.listStatus(path, filter); +if (recursiveListingMaxSize > 0 && fileCounter.addAndGet(dirFs.length) > recursiveListingMaxSize) { + throw UserException Review Comment: This code is executed within fork join pool, and if error suppression flag is enabled, it will call task.fork(). Can we somehow stop executing all tasks for the case when count is reached? > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584701#comment-17584701 ] ASF GitHub Bot commented on DRILL-8283: --- jnturton commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r954644543 ## exec/java-exec/src/main/resources/drill-module.conf: ## @@ -115,7 +115,8 @@ drill.exec: { text: { buffer.size: 262144, batch.size: 4000 - } + }, + recursive_listing_max_size: 1 Review Comment: A limit of 0 (or less) now means no limit and the default, for this PR, is now 0. > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584463#comment-17584463 ] ASF GitHub Bot commented on DRILL-8283: --- vvysotskyi commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r954278444 ## exec/java-exec/src/main/resources/drill-module.conf: ## @@ -115,7 +115,8 @@ drill.exec: { text: { buffer.size: 262144, batch.size: 4000 - } + }, + recursive_listing_max_size: 1 Review Comment: Yes, the default value should be adjusted. For the big data world, thousands of files are quite a small amount. For non-parquet files FileStatus is small, so it shouldn't cause large pressure on memory. For parquet files, it would be good to provide the functionality to disable reading metadata for planning and use it only during execution to avoid issues with huge files amount. > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584449#comment-17584449 ] ASF GitHub Bot commented on DRILL-8283: --- pjfanning commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r954255629 ## exec/java-exec/src/main/resources/drill-module.conf: ## @@ -115,7 +115,8 @@ drill.exec: { text: { buffer.size: 262144, batch.size: 4000 - } + }, + recursive_listing_max_size: 1 Review Comment: My 2 cents is that limits ideally should be set by default to a sensible level. For Drill 2.0.0, enforcing that some sort of limit is set would be something that I'd support. For Drill 1.x, it would not be a good idea to enforce limits by default but supporting them optionally would be useful (to avoid introducing changes that might force users to tune configs in a minor release). > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584429#comment-17584429 ] ASF GitHub Bot commented on DRILL-8283: --- vvysotskyi commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r954203172 ## exec/java-exec/src/main/resources/drill-module.conf: ## @@ -115,7 +115,8 @@ drill.exec: { text: { buffer.size: 262144, batch.size: 4000 - } + }, + recursive_listing_max_size: 1 Review Comment: Could you please make this limit optional? > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584296#comment-17584296 ] ASF GitHub Bot commented on DRILL-8283: --- cgivre commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r953900054 ## exec/java-exec/src/main/java/org/apache/drill/exec/util/FileSystemUtil.java: ## @@ -42,6 +47,12 @@ public class FileSystemUtil { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FileSystemUtil.class); + private static int recursiveListingMaxSize; + + static { +recursiveListingMaxSize = DrillConfig.create().getInt(ExecConstants.RECURSIVE_FILE_LISTING_MAX_SIZE); + } Review Comment: This looks wonky but correct. > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584048#comment-17584048 ] ASF GitHub Bot commented on DRILL-8283: --- jnturton commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r953428323 ## exec/java-exec/src/main/java/org/apache/drill/exec/util/FileSystemUtil.java: ## @@ -42,6 +47,12 @@ public class FileSystemUtil { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FileSystemUtil.class); + private static int recursiveListingMaxSize; + + static { +recursiveListingMaxSize = DrillConfig.create().getInt(ExecConstants.RECURSIVE_FILE_LISTING_MAX_SIZE); + } Review Comment: This route to the config option felt pretty weird, I don't if there's a better way? > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584052#comment-17584052 ] ASF GitHub Bot commented on DRILL-8283: --- jnturton commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r953428323 ## exec/java-exec/src/main/java/org/apache/drill/exec/util/FileSystemUtil.java: ## @@ -42,6 +47,12 @@ public class FileSystemUtil { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FileSystemUtil.class); + private static int recursiveListingMaxSize; + + static { +recursiveListingMaxSize = DrillConfig.create().getInt(ExecConstants.RECURSIVE_FILE_LISTING_MAX_SIZE); + } Review Comment: This route to the config option felt pretty weird, I don't know if there's a better way? > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584047#comment-17584047 ] ASF GitHub Bot commented on DRILL-8283: --- jnturton commented on code in PR #2632: URL: https://github.com/apache/drill/pull/2632#discussion_r953427756 ## exec/java-exec/src/main/java/org/apache/drill/exec/util/FileSystemUtil.java: ## @@ -42,6 +47,12 @@ public class FileSystemUtil { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FileSystemUtil.class); + private static int recursiveListingMaxSize; + + static { +recursiveListingMaxSize = DrillConfig.create().getInt(ExecConstants.RECURSIVE_FILE_LISTING_MAX_SIZE); Review Comment: This route to the config option felt pretty weird, I don't if there's a better way? > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8283) Add a configurable recursive file listing size limit
[ https://issues.apache.org/jira/browse/DRILL-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584043#comment-17584043 ] ASF GitHub Bot commented on DRILL-8283: --- jnturton opened a new pull request, #2632: URL: https://github.com/apache/drill/pull/2632 # [DRILL-8283](https://issues.apache.org/jira/browse/DRILL-8283): Add a configurable recursive file listing size limit ## Description Currently a malicious or merely unwitting user can crash their Drill foreman by sending ``` select * from dfs.huge_workspace limit 10 ``` causing the query planner to recurse over every file in huge_workspace and culminating in ``` 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, exiting. Information message: Unable to handle out of memory condition in Foreman.java.lang.OutOfMemoryError: null ``` if there are enough files in huge_workspace. A SHOW FILES command can produce the same effect. This issue proposes a new BOOT option named drill.exec.storage.file.recursive_listing_max_size with a default value of, say 10 000. If a file listing task exceeds this limit then the initiating operation is terminated with a UserException preventing runaway resource usage. ## Documentation New entry on https://drill.apache.org/docs/start-up-options/ ## Testing FileSystemUtilTest#testRecursiveListingMaxSize > Add a configurable recursive file listing size limit > > > Key: DRILL-8283 > URL: https://issues.apache.org/jira/browse/DRILL-8283 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Other >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > Currently a malicious or merely unwitting user can crash their Drill foreman > by sending > {code:java} > select * from dfs.huge_workspace limit 10 > {code} > causing the query planner to recurse over every file in huge_workspace and > culminating in > {code:java} > 2022-08-09 15:13:22,251 [1d0da29f-e50c-fd51-43d9-8a5086d52c4e:foreman] ERROR > o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > Foreman.java.lang.OutOfMemoryError: null {code} > if there are enough files in huge_workspace. A SHOW FILES command can produce > the same effect. This issue proposes a new BOOT option named > drill.exec.storage.file.recursive_listing_max_size with a default value of, > say 10 000. If a file listing task exceeds this limit then the initiating > operation is terminated with a UserException preventing runaway resource > usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-7856) Add lgtm badge to Drill and fix alerts
[ https://issues.apache.org/jira/browse/DRILL-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583648#comment-17583648 ] ASF GitHub Bot commented on DRILL-7856: --- cgivre closed pull request #2187: DRILL-7856 Add lgtm badge to Drill and fix alerts URL: https://github.com/apache/drill/pull/2187 > Add lgtm badge to Drill and fix alerts > -- > > Key: DRILL-7856 > URL: https://issues.apache.org/jira/browse/DRILL-7856 > Project: Apache Drill > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.18.0 >Reporter: Vitalii Diravka >Priority: Trivial > Labels: badge, github > > Consider adding new badges to Drill github, for instance _lgtm_ badges (code > quality and alerts number): > [https://lgtm.com/projects/g/apache/drill/context:java] > As an example please check: > [https://github.com/kaitoy/pcap4j] > As a separate ticket can be considered decreasing the number of alerts of > Drill project: > https://lgtm.com/projects/g/apache/drill/alerts/?mode=list -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-7856) Add lgtm badge to Drill and fix alerts
[ https://issues.apache.org/jira/browse/DRILL-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583647#comment-17583647 ] ASF GitHub Bot commented on DRILL-7856: --- cgivre commented on PR #2187: URL: https://github.com/apache/drill/pull/2187#issuecomment-1224126726 LGTM is closing in Dec, 2022. https://github.blog/2022-08-15-the-next-step-for-lgtm-com-github-code-scanning/ > Add lgtm badge to Drill and fix alerts > -- > > Key: DRILL-7856 > URL: https://issues.apache.org/jira/browse/DRILL-7856 > Project: Apache Drill > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.18.0 >Reporter: Vitalii Diravka >Priority: Trivial > Labels: badge, github > > Consider adding new badges to Drill github, for instance _lgtm_ badges (code > quality and alerts number): > [https://lgtm.com/projects/g/apache/drill/context:java] > As an example please check: > [https://github.com/kaitoy/pcap4j] > As a separate ticket can be considered decreasing the number of alerts of > Drill project: > https://lgtm.com/projects/g/apache/drill/alerts/?mode=list -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8282) upgrade to hadoop-common 3.2.4 due to cve
[ https://issues.apache.org/jira/browse/DRILL-8282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583216#comment-17583216 ] ASF GitHub Bot commented on DRILL-8282: --- pjfanning opened a new pull request, #2630: URL: https://github.com/apache/drill/pull/2630 ## Description There is a CVE fix in hadoop 3.2.4 ## Documentation (Please describe user-visible changes similar to what should appear in the Drill documentation.) ## Testing (Please describe how this PR has been tested.) > upgrade to hadoop-common 3.2.4 due to cve > -- > > Key: DRILL-8282 > URL: https://issues.apache.org/jira/browse/DRILL-8282 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-8wm5-8h9c-47pc > * this change requires some reload4j dependency changes too - see broken > build - https://github.com/apache/drill/pull/2628 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8281) Info schema LIKE with ESCAPE push down bug
[ https://issues.apache.org/jira/browse/DRILL-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582725#comment-17582725 ] ASF GitHub Bot commented on DRILL-8281: --- jnturton merged PR #2627: URL: https://github.com/apache/drill/pull/2627 > Info schema LIKE with ESCAPE push down bug > -- > > Key: DRILL-8281 > URL: https://issues.apache.org/jira/browse/DRILL-8281 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Information Schema >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > DRILL-8057 brought in a regression whereby info schema LIKE patterns > containing an escape character are not correctly processed. For example if a > storage plugin called dfs_foo (note the presence of the special '_') is > present then the following query wrongly returns no records. > {code:java} > apache drill> show databases where schema_name like 'dfs^_foo.%' escape '^'; > No rows selected (2.305 seconds){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8281) Info schema LIKE with ESCAPE push down bug
[ https://issues.apache.org/jira/browse/DRILL-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581968#comment-17581968 ] ASF GitHub Bot commented on DRILL-8281: --- jnturton opened a new pull request, #2627: URL: https://github.com/apache/drill/pull/2627 # [DRILL-8281](https://issues.apache.org/jira/browse/DRILL-8281): Info schema LIKE with ESCAPE push down bug ## Description [DRILL-8057](https://issues.apache.org/jira/browse/DRILL-8057) brought in a regression whereby info schema LIKE patterns containing an escape character are not correctly processed. For example if a storage plugin called dfs_foo (note the presence of the special '_') is present then the following query wrongly returns no records. ``` apache drill> show databases where schema_name like 'dfs^_foo.%' escape '^'; No rows selected (2.305 seconds) ``` This PR makes schema path prefix comparison use RegexpUtil.SqlPatternInfo#getSimplePatternString when comparing prefixes so that escape characters are correctly processed. ## Documentation N/A ## Testing TestInfoSchema#likePatternWithEscapeChar > Info schema LIKE with ESCAPE push down bug > -- > > Key: DRILL-8281 > URL: https://issues.apache.org/jira/browse/DRILL-8281 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Information Schema >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > > DRILL-8057 brought in a regression whereby info schema LIKE patterns > containing an escape character are not correctly processed. For example if a > storage plugin called dfs_foo (note the presence of the special '_') is > present then the following query wrongly returns no records. > {code:java} > apache drill> show databases where schema_name like 'dfs^_foo.%' escape '^'; > No rows selected (2.305 seconds){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8280) Cannot ANALYZE files containing non-ASCII column names
[ https://issues.apache.org/jira/browse/DRILL-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580927#comment-17580927 ] ASF GitHub Bot commented on DRILL-8280: --- cgivre merged PR #2625: URL: https://github.com/apache/drill/pull/2625 > Cannot ANALYZE files containing non-ASCII column names > --- > > Key: DRILL-8280 > URL: https://issues.apache.org/jira/browse/DRILL-8280 > Project: Apache Drill > Issue Type: Bug > Components: Metadata >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > Attachments: 0_0_0.parquet > > > The attached Parquet file contains a single column named "Käse". If it is > saved under /tmp/utf8_col and then the Drill command > {code:java} > analyze table dfs.tmp.utf8_col columns none refresh metadata;{code} > is run then the following error is raised during the execution of the > merge_schema function. > {code:java} > com.fasterxml.jackson.databind.JsonMappingException: Unrecognized character > escape 'x' (code 120) > at [Source: > (String)"{"type":"tuple_schema","columns":[{"name":"K\xC3\xA4se","type":"VARCHAR","mode":"REQUIRED"}]}"; > line: 1, column: 47]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8280) Cannot ANALYZE files containing non-ASCII column names
[ https://issues.apache.org/jira/browse/DRILL-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580817#comment-17580817 ] ASF GitHub Bot commented on DRILL-8280: --- jnturton opened a new pull request, #2625: URL: https://github.com/apache/drill/pull/2625 # [DRILL-8280](https://issues.apache.org/jira/browse/DRILL-8280): Cannot ANALYZE files containing non-ASCII column names ## Description The merge_schema function in SchemaFunctions is modified to use UTF-8 string parsing so that a column with a name like "Käse" will no longer crash ANALYZE TABLE REFRESH METADATA. ## Documentation N/A ## Testing TestMetastoreCommands#testNonAsciiColumnName > Cannot ANALYZE files containing non-ASCII column names > --- > > Key: DRILL-8280 > URL: https://issues.apache.org/jira/browse/DRILL-8280 > Project: Apache Drill > Issue Type: Bug > Components: Metadata >Affects Versions: 1.20.2 >Reporter: James Turton >Assignee: James Turton >Priority: Minor > Fix For: 1.20.3 > > Attachments: 0_0_0.parquet > > > The attached Parquet file contains a single column named "Käse". If it is > saved under /tmp/utf8_col and then the Drill command > {code:java} > analyze table dfs.tmp.utf8_col columns none refresh metadata;{code} > is run then the following error is raised during the execution of the > merge_schema function. > {code:java} > com.fasterxml.jackson.databind.JsonMappingException: Unrecognized character > escape 'x' (code 120) > at [Source: > (String)"{"type":"tuple_schema","columns":[{"name":"K\xC3\xA4se","type":"VARCHAR","mode":"REQUIRED"}]}"; > line: 1, column: 47]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8279) Use thick Phoenix driver
[ https://issues.apache.org/jira/browse/DRILL-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580353#comment-17580353 ] ASF GitHub Bot commented on DRILL-8279: --- vvysotskyi merged PR #2624: URL: https://github.com/apache/drill/pull/2624 > Use thick Phoenix driver > > > Key: DRILL-8279 > URL: https://issues.apache.org/jira/browse/DRILL-8279 > Project: Apache Drill > Issue Type: Bug >Reporter: Vova Vysotskyi >Assignee: Vova Vysotskyi >Priority: Blocker > > phoenix-queryserver-client shades Avatica classes, so it causes issues when > starting Drill and shaded class from phoenix jars is loaded before, so Drill > wouldn't be able to start correctly. > To avoid that, phoenix thick client can be used, it also will improve query > performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-7916) Support new plugin installation on the running system
[ https://issues.apache.org/jira/browse/DRILL-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580338#comment-17580338 ] ASF GitHub Bot commented on DRILL-7916: --- luocooong closed pull request #2215: DRILL-7916: Support new plugin installation on the running system URL: https://github.com/apache/drill/pull/2215 > Support new plugin installation on the running system > - > > Key: DRILL-7916 > URL: https://issues.apache.org/jira/browse/DRILL-7916 > Project: Apache Drill > Issue Type: New Feature >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > > Drill does not support the new plugin installation on the running system : > # Boot the Drill. > # Load plugins to the persistent storage : `pluginStore`. > ## Upgrade the plugin if the override file exist > (storage-plugins-override.conf). (Done) > ## Check and add new plugin with the new release. (To-do) > ## If 1 and 2 are not true, then initial all the plugins via loading > bootstrap configuration. (Done) > # End the Boot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8279) Use thick Phoenix driver
[ https://issues.apache.org/jira/browse/DRILL-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580228#comment-17580228 ] ASF GitHub Bot commented on DRILL-8279: --- vvysotskyi commented on PR #2622: URL: https://github.com/apache/drill/pull/2622#issuecomment-1216459627 @luocooong, sorry, I didn't aim to outspeak overbearing, I just wanted to express my thoughts on why I didn't start a discussion on the mailing list. If you have ideas on how to avoid this classpath issue and use phoenix-queryserver-client, feel free to suggest them or create a pull request with changes. > Use thick Phoenix driver > > > Key: DRILL-8279 > URL: https://issues.apache.org/jira/browse/DRILL-8279 > Project: Apache Drill > Issue Type: Bug >Reporter: Vova Vysotskyi >Assignee: Vova Vysotskyi >Priority: Blocker > > phoenix-queryserver-client shades Avatica classes, so it causes issues when > starting Drill and shaded class from phoenix jars is loaded before, so Drill > wouldn't be able to start correctly. > To avoid that, phoenix thick client can be used, it also will improve query > performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8279) Use thick Phoenix driver
[ https://issues.apache.org/jira/browse/DRILL-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580225#comment-17580225 ] ASF GitHub Bot commented on DRILL-8279: --- vvysotskyi opened a new pull request, #2624: URL: https://github.com/apache/drill/pull/2624 # [DRILL-8279](https://issues.apache.org/jira/browse/DRILL-8279): Rename skip tests property to match maven-surefire property name ## Description Renamed property to skip tests to since they were also running with -DskipTests flag. ## Documentation NA ## Testing Now check style job also skips phoenix tests. > Use thick Phoenix driver > > > Key: DRILL-8279 > URL: https://issues.apache.org/jira/browse/DRILL-8279 > Project: Apache Drill > Issue Type: Bug >Reporter: Vova Vysotskyi >Assignee: Vova Vysotskyi >Priority: Blocker > > phoenix-queryserver-client shades Avatica classes, so it causes issues when > starting Drill and shaded class from phoenix jars is loaded before, so Drill > wouldn't be able to start correctly. > To avoid that, phoenix thick client can be used, it also will improve query > performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8279) Use thick Phoenix driver
[ https://issues.apache.org/jira/browse/DRILL-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580208#comment-17580208 ] ASF GitHub Bot commented on DRILL-8279: --- luocooong commented on PR #2622: URL: https://github.com/apache/drill/pull/2622#issuecomment-1216419739 Sorry, I cannot accept the above overbearing statement. I don't want to argue any more about this pull request, because that pull request was actually merged, so the right to speak is your right. > Use thick Phoenix driver > > > Key: DRILL-8279 > URL: https://issues.apache.org/jira/browse/DRILL-8279 > Project: Apache Drill > Issue Type: Bug >Reporter: Vova Vysotskyi >Assignee: Vova Vysotskyi >Priority: Blocker > > phoenix-queryserver-client shades Avatica classes, so it causes issues when > starting Drill and shaded class from phoenix jars is loaded before, so Drill > wouldn't be able to start correctly. > To avoid that, phoenix thick client can be used, it also will improve query > performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8279) Use thick Phoenix driver
[ https://issues.apache.org/jira/browse/DRILL-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580202#comment-17580202 ] ASF GitHub Bot commented on DRILL-8279: --- vvysotskyi commented on PR #2622: URL: https://github.com/apache/drill/pull/2622#issuecomment-1216406330 @luocooong, at least I know several people that were already affected by this classpath issue, so it was the reason for fixing it quickly. Jira ticket was created `14 Aug 14:36 EEST`, pull request was opened `14 Aug 17:53 EEST`, and merged `16 Aug 10:37 EEST`, so this time should be enough to participate in discussion or request some changes. I didn't see a reason for creating a discussion in a mailing list for it, since plugin wasn't deleted, it is still functioning and even have improved developers experience by removing extra steps to run unit tests, enabling them in CI, and removing dependencies on custom repositories. > Use thick Phoenix driver > > > Key: DRILL-8279 > URL: https://issues.apache.org/jira/browse/DRILL-8279 > Project: Apache Drill > Issue Type: Bug >Reporter: Vova Vysotskyi >Assignee: Vova Vysotskyi >Priority: Blocker > > phoenix-queryserver-client shades Avatica classes, so it causes issues when > starting Drill and shaded class from phoenix jars is loaded before, so Drill > wouldn't be able to start correctly. > To avoid that, phoenix thick client can be used, it also will improve query > performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8279) Use thick Phoenix driver
[ https://issues.apache.org/jira/browse/DRILL-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580189#comment-17580189 ] ASF GitHub Bot commented on DRILL-8279: --- luocooong commented on PR #2622: URL: https://github.com/apache/drill/pull/2622#issuecomment-1216372092 It was a real disappointment today. As Java developers, all path conflicts will be solved in a way, but we chose one of the worst results. What are we doing in such a hurry? And I didn't see this pull request open for discussion before submission... What does this mean for contributors? > Use thick Phoenix driver > > > Key: DRILL-8279 > URL: https://issues.apache.org/jira/browse/DRILL-8279 > Project: Apache Drill > Issue Type: Bug >Reporter: Vova Vysotskyi >Assignee: Vova Vysotskyi >Priority: Blocker > > phoenix-queryserver-client shades Avatica classes, so it causes issues when > starting Drill and shaded class from phoenix jars is loaded before, so Drill > wouldn't be able to start correctly. > To avoid that, phoenix thick client can be used, it also will improve query > performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8279) Use thick Phoenix driver
[ https://issues.apache.org/jira/browse/DRILL-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580184#comment-17580184 ] ASF GitHub Bot commented on DRILL-8279: --- vvysotskyi commented on PR #2622: URL: https://github.com/apache/drill/pull/2622#issuecomment-1216365721 @luocooong, Drill plugins are still pluggable, so you can provide your own implementations if you need. Phoenix official connectors [1] for such big data tools like Spark, Hive and Pig also use thick client, so such decision should be production-suitable. By the way I didn't find official connectors that use thin client in that repository. Ideally, if Phoenix thin client shades some libraries, it should also relocate them to avoid such issues. I don't see any other correct way for resolving this class path conflict in other way. Creating specific module and repacking Phoenix there when building Drill is overhead, and didn't guarantee that nothing would be broken. Having custom repo that provides relocated classes also not good decision since it will make more complex supporting new Phoenix versions. [1] https://github.com/apache/phoenix-connectors > Use thick Phoenix driver > > > Key: DRILL-8279 > URL: https://issues.apache.org/jira/browse/DRILL-8279 > Project: Apache Drill > Issue Type: Bug >Reporter: Vova Vysotskyi >Assignee: Vova Vysotskyi >Priority: Blocker > > phoenix-queryserver-client shades Avatica classes, so it causes issues when > starting Drill and shaded class from phoenix jars is loaded before, so Drill > wouldn't be able to start correctly. > To avoid that, phoenix thick client can be used, it also will improve query > performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8279) Use thick Phoenix driver
[ https://issues.apache.org/jira/browse/DRILL-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580178#comment-17580178 ] ASF GitHub Bot commented on DRILL-8279: --- jnturton commented on PR #2622: URL: https://github.com/apache/drill/pull/2622#issuecomment-1216343681 @luocooong let's see if we can offer storage plugins for both thick and thin drivers then? The classpath conflict bug in the thin driver is very serious, even if you have not yet been affected. It was important that some action was taken immediately. Nothing has been released yet so we still have time to come with a path forward that works for everyone. Btw I thought I'd requested your review on this this PR as it was opened but I looking at the history I see must have failed to use the GH mobile app correctly so I do apologise if I only managed to bring this to your attention relatively late. > Use thick Phoenix driver > > > Key: DRILL-8279 > URL: https://issues.apache.org/jira/browse/DRILL-8279 > Project: Apache Drill > Issue Type: Bug >Reporter: Vova Vysotskyi >Assignee: Vova Vysotskyi >Priority: Blocker > > phoenix-queryserver-client shades Avatica classes, so it causes issues when > starting Drill and shaded class from phoenix jars is loaded before, so Drill > wouldn't be able to start correctly. > To avoid that, phoenix thick client can be used, it also will improve query > performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8279) Use thick Phoenix driver
[ https://issues.apache.org/jira/browse/DRILL-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580142#comment-17580142 ] ASF GitHub Bot commented on DRILL-8279: --- luocooong commented on PR #2622: URL: https://github.com/apache/drill/pull/2622#issuecomment-1216295980 **_Holy cow, that's unbelievable !_** Why did this get approved without production experience? Why do we need to force use of fat clients? There are ways to resolve package conflicts, but this pull request directly lets me discard using Drill to query Phoenix ! > Use thick Phoenix driver > > > Key: DRILL-8279 > URL: https://issues.apache.org/jira/browse/DRILL-8279 > Project: Apache Drill > Issue Type: Bug >Reporter: Vova Vysotskyi >Assignee: Vova Vysotskyi >Priority: Blocker > > phoenix-queryserver-client shades Avatica classes, so it causes issues when > starting Drill and shaded class from phoenix jars is loaded before, so Drill > wouldn't be able to start correctly. > To avoid that, phoenix thick client can be used, it also will improve query > performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8279) Use thick Phoenix driver
[ https://issues.apache.org/jira/browse/DRILL-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580109#comment-17580109 ] ASF GitHub Bot commented on DRILL-8279: --- vvysotskyi merged PR #2622: URL: https://github.com/apache/drill/pull/2622 > Use thick Phoenix driver > > > Key: DRILL-8279 > URL: https://issues.apache.org/jira/browse/DRILL-8279 > Project: Apache Drill > Issue Type: Bug >Reporter: Vova Vysotskyi >Assignee: Vova Vysotskyi >Priority: Blocker > > phoenix-queryserver-client shades Avatica classes, so it causes issues when > starting Drill and shaded class from phoenix jars is loaded before, so Drill > wouldn't be able to start correctly. > To avoid that, phoenix thick client can be used, it also will improve query > performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)