[jira] [Created] (DRILL-4898) wrong results : Query on directory containing CSV data
Khurram Faraaz created DRILL-4898: - Summary: wrong results : Query on directory containing CSV data Key: DRILL-4898 URL: https://issues.apache.org/jira/browse/DRILL-4898 Project: Apache Drill Issue Type: Bug Components: Execution - Flow Affects Versions: 1.9.0 Environment: 4 node cluster Reporter: Khurram Faraaz incorrect results : Query on directory containing CSV data directory has 4534327 number of rows ~ 4.5M records (there are 6 CSV files) Drill 1.9.0 commit ID: f3c26e34 Data is available here - /home/MAPRTECH/qa/drill/uber_trip_data Note that data in columns[3] has the value "B02512\r" in query results. {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5; +--+ | columns | +--+ | ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"] | | ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"] | | ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"] | | ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"] | | ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"] | +--+ 5 rows selected (0.184 seconds) {noformat} But when we do a select on columns[3] we see a different value in the query result. {noformat} 0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5; +--+ | EXPR$0 | +--+ |02512 |02512 |02512 |02512 |02512 +--+ 5 rows selected (0.159 seconds) {noformat} Searching for 'B02512' returns no rows. (where as it should have returned data) {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` where columns[3]='B02512'; +--+ | columns | +--+ +--+ No rows selected (1.707 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-4897) NumberFormatException in Drill SQL while casting to BIGINT when its actually a number
Srihari Karanth created DRILL-4897: -- Summary: NumberFormatException in Drill SQL while casting to BIGINT when its actually a number Key: DRILL-4897 URL: https://issues.apache.org/jira/browse/DRILL-4897 Project: Apache Drill Issue Type: Bug Components: Functions - Drill Reporter: Srihari Karanth Priority: Blocker In the following SQL, drill cribs when trying to convert a number which is in varchar select cast (case IsNumeric(Delta_Radio_Delay) when 0 then 0 else Delta_Radio_Delay end as BIGINT) from datasource.`./sometable` where Delta_Radio_Delay='4294967294'; BIGINT should be able to take very large number. I dont understand how it throws the below error: 0: jdbc:drill:> select cast (case IsNumeric(Delta_Radio_Delay) when 0 then 0 else Delta_Radio_Delay end as BIGINT) from datasource.`./sometable` where Delta_Radio_Delay='4294967294'; Error: SYSTEM ERROR: NumberFormatException: 4294967294 Fragment 1:29 [Error Id: a63bb113-271f-4d8b-8194-2c9728543200 on cluster-3:31010] (state=,code=0) How can i modify SQL to fix this? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-4866) Provide TABLE and PARTITION information in INFORMATION_SCHEMA for parquet tables created by Drill
[ https://issues.apache.org/jira/browse/DRILL-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-4866: Assignee: Serhii Harnyk (was: Zelaine Fong) > Provide TABLE and PARTITION information in INFORMATION_SCHEMA for parquet > tables created by Drill > - > > Key: DRILL-4866 > URL: https://issues.apache.org/jira/browse/DRILL-4866 > Project: Apache Drill > Issue Type: Improvement > Components: Metadata, Storage - Parquet >Reporter: Andries Engelbrecht >Assignee: Serhii Harnyk > > Provide the Table and Partition information on parquet tables created by > Drill in INFORMATION_SCHEMA. This can be utilized by tools and users looking > to optimize Drill queries by referencing the table and partition metadata > from within Drill, as opposed to querying the parquet metadata underneath. > Potentially extend INFORMATION_SCHEMA with an additional PARTITIONS table > similar to MySQL to provide information on column(s) used for partitioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4866) Provide TABLE and PARTITION information in INFORMATION_SCHEMA for parquet tables created by Drill
[ https://issues.apache.org/jira/browse/DRILL-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510388#comment-15510388 ] Arina Ielchiieva commented on DRILL-4866: - Might be related to idea of having .drill file - https://issues.apache.org/jira/browse/DRILL-3572 > Provide TABLE and PARTITION information in INFORMATION_SCHEMA for parquet > tables created by Drill > - > > Key: DRILL-4866 > URL: https://issues.apache.org/jira/browse/DRILL-4866 > Project: Apache Drill > Issue Type: Improvement > Components: Metadata, Storage - Parquet >Reporter: Andries Engelbrecht >Assignee: Serhii Harnyk > > Provide the Table and Partition information on parquet tables created by > Drill in INFORMATION_SCHEMA. This can be utilized by tools and users looking > to optimize Drill queries by referencing the table and partition metadata > from within Drill, as opposed to querying the parquet metadata underneath. > Potentially extend INFORMATION_SCHEMA with an additional PARTITIONS table > similar to MySQL to provide information on column(s) used for partitioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4842) SELECT * on JSON data results in NumberFormatException
[ https://issues.apache.org/jira/browse/DRILL-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510243#comment-15510243 ] ASF GitHub Bot commented on DRILL-4842: --- GitHub user Serhii-Harnyk opened a pull request: https://github.com/apache/drill/pull/594 DRILL-4842: SELECT * on JSON data results in NumberFormatException You can merge this pull request into a Git repository by running: $ git pull https://github.com/Serhii-Harnyk/drill DRILL-4842 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/594.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #594 commit 190a69a65ad8c144b164c2acacf9718a0ecb3768 Author: Serhii-HarnykDate: 2016-09-08T18:11:37Z DRILL-4842: SELECT * on JSON data results in NumberFormatException > SELECT * on JSON data results in NumberFormatException > -- > > Key: DRILL-4842 > URL: https://issues.apache.org/jira/browse/DRILL-4842 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow >Affects Versions: 1.2.0 >Reporter: Khurram Faraaz >Assignee: Serhii Harnyk > Attachments: tooManyNulls.json > > > Note that doing SELECT c1 returns correct results, the failure is seen when > we do SELECT star. json.all_text_mode was set to true. > JSON file tooManyNulls.json has one key c1 with 4096 nulls as its value and > the 4097th key c1 has the value "Hello World" > git commit ID : aaf220ff > MapR Drill 1.8.0 RPM > {noformat} > 0: jdbc:drill:schema=dfs.tmp> alter session set > `store.json.all_text_mode`=true; > +---++ > | ok | summary | > +---++ > | true | store.json.all_text_mode updated. | > +---++ > 1 row selected (0.27 seconds) > 0: jdbc:drill:schema=dfs.tmp> SELECT c1 FROM `tooManyNulls.json` WHERE c1 IN > ('Hello World'); > +--+ > | c1 | > +--+ > | Hello World | > +--+ > 1 row selected (0.243 seconds) > 0: jdbc:drill:schema=dfs.tmp> select * FROM `tooManyNulls.json` WHERE c1 IN > ('Hello World'); > Error: SYSTEM ERROR: NumberFormatException: Hello World > Fragment 0:0 > [Error Id: 9cafb3f9-3d5c-478a-b55c-900602b8765e on centos-01.qa.lab:31010] > (java.lang.NumberFormatException) Hello World > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI():95 > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varTypesToInt():120 > org.apache.drill.exec.test.generated.FiltererGen1169.doSetup():45 > org.apache.drill.exec.test.generated.FiltererGen1169.setup():54 > > org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.generateSV2Filterer():195 > > org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.setupNewSchema():107 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():78 > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94 > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > org.apache.drill.exec.physical.impl.BaseRootExec.next():104 > > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81 > org.apache.drill.exec.physical.impl.BaseRootExec.next():94 > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257 > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251 > java.security.AccessController.doPrivileged():-2 > javax.security.auth.Subject.doAs():415 > org.apache.hadoop.security.UserGroupInformation.doAs():1595 >
[jira] [Commented] (DRILL-4899) Hive Plugin goes to disabled status with restart of Drill and ZK
[ https://issues.apache.org/jira/browse/DRILL-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511572#comment-15511572 ] Andries Engelbrecht commented on DRILL-4899: In this case the Hive Plugin config details are retained, but the plugin itself is disabled on startup although it was enabled before shutdown. > Hive Plugin goes to disabled status with restart of Drill and ZK > > > Key: DRILL-4899 > URL: https://issues.apache.org/jira/browse/DRILL-4899 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Hive >Affects Versions: 1.8.0 >Reporter: Andries Engelbrecht > > When restarting ZK and Drill the Hive storage plugin is disabled by default > and requires manual steps to enable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-4900) Query across Sybase and Oracle plugins is dropping WHERE clause
Robert DeVito created DRILL-4900: Summary: Query across Sybase and Oracle plugins is dropping WHERE clause Key: DRILL-4900 URL: https://issues.apache.org/jira/browse/DRILL-4900 Project: Apache Drill Issue Type: Bug Components: Client - JDBC, Storage - JDBC Affects Versions: 1.6.0 Environment: Windows client. Sybase and Oracle hosts on unix Reporter: Robert DeVito Have tried several approaches of joining simple queries with Oracle and Sybase. In all cases, we have sufficient WHERE clause on each side to really limit data. Each time, the Drill execution plan skips the WHERE clause on one side. ex: select a.f, b.b from ( select * from pl1.`owner`.`dbo`.`VIEW1` d where d.fid = '300769' and d.PDate = ('2013-10-31') ) a, ( select * from pl2.owner.VIEW2 v where v.f = '300769' and v.d = 'M' and v.b IN ('UK221','UK222','UK223','UK224','UK225','UK227','08843','BU5552','BU5543','BU5544') and v.dk = '20131031' ) b where a.f = b.f and a.S = b.S Please ignore the obfuscated column names. Syntax is valid, but Drill keeps sending selects with no WHERE clause for one subquery or the other. Can't understand why, or how to control it. This is a make or break for us. Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4899) Hive Plugin goes to disabled status with restart of Drill and ZK
[ https://issues.apache.org/jira/browse/DRILL-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511421#comment-15511421 ] Zelaine Fong commented on DRILL-4899: - Related to DRILL-4879? > Hive Plugin goes to disabled status with restart of Drill and ZK > > > Key: DRILL-4899 > URL: https://issues.apache.org/jira/browse/DRILL-4899 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Hive >Affects Versions: 1.8.0 >Reporter: Andries Engelbrecht > > When restarting ZK and Drill the Hive storage plugin is disabled by default > and requires manual steps to enable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-4898) wrong results : Query on directory containing CSV data
[ https://issues.apache.org/jira/browse/DRILL-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Khurram Faraaz updated DRILL-4898: -- Description: incorrect results : Query on directory containing CSV data directory has 4534327 number of rows ~ 4.5M records (there are 6 CSV files) Drill 1.9.0 commit ID: f3c26e34 I can share the data to reproduce the issue. Note that data in columns[3] has the value "B02512\r" in query results. {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5; +--+ | columns | +--+ | ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"] | | ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"] | | ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"] | | ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"] | | ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"] | +--+ 5 rows selected (0.184 seconds) {noformat} But when we do a select on columns[3] we see a different value in the query result. {noformat} 0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5; +--+ | EXPR$0 | +--+ |02512 |02512 |02512 |02512 |02512 +--+ 5 rows selected (0.159 seconds) {noformat} Searching for 'B02512' returns no rows. (where as it should have returned data) {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` where columns[3]='B02512'; +--+ | columns | +--+ +--+ No rows selected (1.707 seconds) {noformat} was: incorrect results : Query on directory containing CSV data directory has 4534327 number of rows ~ 4.5M records (there are 6 CSV files) Drill 1.9.0 commit ID: f3c26e34 Data is available here - /home/MAPRTECH/qa/drill/uber_trip_data Note that data in columns[3] has the value "B02512\r" in query results. {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5; +--+ | columns | +--+ | ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"] | | ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"] | | ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"] | | ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"] | | ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"] | +--+ 5 rows selected (0.184 seconds) {noformat} But when we do a select on columns[3] we see a different value in the query result. {noformat} 0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5; +--+ | EXPR$0 | +--+ |02512 |02512 |02512 |02512 |02512 +--+ 5 rows selected (0.159 seconds) {noformat} Searching for 'B02512' returns no rows. (where as it should have returned data) {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` where columns[3]='B02512'; +--+ | columns | +--+ +--+ No rows selected (1.707 seconds) {noformat} > wrong results : Query on directory containing CSV data > -- > > Key: DRILL-4898 > URL: https://issues.apache.org/jira/browse/DRILL-4898 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow >Affects Versions: 1.9.0 > Environment: 4 node cluster >Reporter: Khurram Faraaz > > incorrect results : Query on directory containing CSV data > directory has 4534327 number of rows ~ 4.5M records (there are 6 CSV files) > Drill 1.9.0 commit ID: f3c26e34 > I can share the data to reproduce the issue. > Note that data in columns[3] has the value "B02512\r" in query results. > {noformat} > 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5; > +--+ > | columns | > +--+ > | ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"] | > | ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"] | > | ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"] | > | ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"] | > | ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"] | > +--+ > 5 rows selected (0.184 seconds) > {noformat} > But when we do a select on columns[3] we see a different value in the query > result. > {noformat} > 0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5; > +--+ > | EXPR$0 | >
[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing
[ https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512183#comment-15512183 ] ASF GitHub Bot commented on DRILL-4653: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/518 Looks like you are right; the JsonParser is more than a simple tokenizer. We're not the first to try this: http://stackoverflow.com/questions/37511496/recover-from-malformed-json-with-jackson (no answer) I tried an experiment and found that you are on the right track: the way you are using the JsonParser can be extended to ignore input until the start of the next object. A quick demonstration: private static void recover(JsonParser parser) throws IOException { for ( ; ; ) { JsonToken token; try { token = parser.nextToken(); } catch( JsonParseException e ) { continue; } if ( token == null ) return; if ( token != JsonToken.END_OBJECT ) { continue; } token = parser.nextToken(); if ( token == null ) return; if ( token == JsonToken.START_OBJECT ) { return; } } } Basically, we keep reading tokens, and ignoring errors, until we successfully find the } { pair. As we discussed before, to use the above in Drill, we have to discard the partly-built record, and start reading the next record assiming the parser is positioned **after** the START_OBJECT ("{") token, which we've already consumed. That should be simple. Still, to do proper recovery, we have to discard the partly-built JSON record. I've not looked into how to do that. If we don't do that, we return the bogus partly-built record. Worse, if we recover by trying to build a new record, we create more partly-built records, but with a different schema, possibly triggering a schema change event when not really necessary. Any ideas for how to solve that problem? > Malformed JSON should not stop the entire query from progressing > > > Key: DRILL-4653 > URL: https://issues.apache.org/jira/browse/DRILL-4653 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - JSON >Affects Versions: 1.6.0 >Reporter: subbu srinivasan > Fix For: Future > > > Currently Drill query terminates upon first encounter of a invalid JSON line. > Drill has to continue progressing after ignoring the bad records. Something > similar to a setting of (ignore.malformed.json) would help. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-4899) Hive Plugin goes to disabled status with restart of Drill and ZK
Andries Engelbrecht created DRILL-4899: -- Summary: Hive Plugin goes to disabled status with restart of Drill and ZK Key: DRILL-4899 URL: https://issues.apache.org/jira/browse/DRILL-4899 Project: Apache Drill Issue Type: Bug Components: Storage - Hive Affects Versions: 1.8.0 Reporter: Andries Engelbrecht When restarting ZK and Drill the Hive storage plugin is disabled by default and requires manual steps to enable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)