[jira] [Created] (DRILL-4898) wrong results : Query on directory containing CSV data

2016-09-21 Thread Khurram Faraaz (JIRA)
Khurram Faraaz created DRILL-4898:
-

 Summary: wrong results : Query on directory containing CSV data
 Key: DRILL-4898
 URL: https://issues.apache.org/jira/browse/DRILL-4898
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Flow
Affects Versions: 1.9.0
 Environment: 4 node cluster
Reporter: Khurram Faraaz


incorrect results : Query on directory containing CSV data
directory has 4534327 number of rows ~ 4.5M records  (there are 6 CSV files)
Drill 1.9.0 commit ID: f3c26e34
Data is available here - /home/MAPRTECH/qa/drill/uber_trip_data

Note that data in columns[3] has the value "B02512\r" in query results.
{noformat} 
0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5;
+--+
| columns  |
+--+
| ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"]  |
| ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"]   |
| ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"]  |
| ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"]  |
| ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"]  |
+--+
5 rows selected (0.184 seconds)
{noformat}

But when we do a select on columns[3] we see a different value in the query 
result.
{noformat}
0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5;
+--+
|  EXPR$0  |
+--+
  |02512
  |02512
  |02512
  |02512
  |02512
+--+
5 rows selected (0.159 seconds)  
{noformat}

Searching for 'B02512' returns no rows. (where as it should have returned data)
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` where 
columns[3]='B02512';
+--+
| columns  |
+--+
+--+
No rows selected (1.707 seconds)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4897) NumberFormatException in Drill SQL while casting to BIGINT when its actually a number

2016-09-21 Thread Srihari Karanth (JIRA)
Srihari Karanth created DRILL-4897:
--

 Summary: NumberFormatException in Drill SQL while casting to 
BIGINT when its actually a number
 Key: DRILL-4897
 URL: https://issues.apache.org/jira/browse/DRILL-4897
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill
Reporter: Srihari Karanth
Priority: Blocker


In the following SQL, drill cribs when trying to convert a number which is in 
varchar

   select cast (case IsNumeric(Delta_Radio_Delay)  
when 0 then 0 else Delta_Radio_Delay end as BIGINT) 
from datasource.`./sometable` 
where Delta_Radio_Delay='4294967294';

BIGINT should be able to take very large number. I dont understand how it 
throws the below error:

0: jdbc:drill:> select cast (case IsNumeric(Delta_Radio_Delay)  
when 0 then 0 else Delta_Radio_Delay end as BIGINT) 
from datasource.`./sometable` 
where Delta_Radio_Delay='4294967294';

Error: SYSTEM ERROR: NumberFormatException: 4294967294
Fragment 1:29
[Error Id: a63bb113-271f-4d8b-8194-2c9728543200 on cluster-3:31010] 
(state=,code=0)


How can i modify SQL to fix this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4866) Provide TABLE and PARTITION information in INFORMATION_SCHEMA for parquet tables created by Drill

2016-09-21 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-4866:

Assignee: Serhii Harnyk  (was: Zelaine Fong)

> Provide TABLE and PARTITION information in INFORMATION_SCHEMA for parquet 
> tables created by Drill
> -
>
> Key: DRILL-4866
> URL: https://issues.apache.org/jira/browse/DRILL-4866
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Parquet
>Reporter: Andries Engelbrecht
>Assignee: Serhii Harnyk
>
> Provide the Table and Partition information on parquet tables created by 
> Drill in INFORMATION_SCHEMA. This can be utilized by tools and users looking 
> to optimize Drill queries by referencing the table and partition metadata 
> from within Drill, as opposed to querying the parquet metadata underneath.
> Potentially extend INFORMATION_SCHEMA with an additional PARTITIONS table 
> similar to MySQL to provide information on column(s) used for partitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4866) Provide TABLE and PARTITION information in INFORMATION_SCHEMA for parquet tables created by Drill

2016-09-21 Thread Arina Ielchiieva (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510388#comment-15510388
 ] 

Arina Ielchiieva commented on DRILL-4866:
-

Might be related to idea of having .drill file - 
https://issues.apache.org/jira/browse/DRILL-3572

> Provide TABLE and PARTITION information in INFORMATION_SCHEMA for parquet 
> tables created by Drill
> -
>
> Key: DRILL-4866
> URL: https://issues.apache.org/jira/browse/DRILL-4866
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Metadata, Storage - Parquet
>Reporter: Andries Engelbrecht
>Assignee: Serhii Harnyk
>
> Provide the Table and Partition information on parquet tables created by 
> Drill in INFORMATION_SCHEMA. This can be utilized by tools and users looking 
> to optimize Drill queries by referencing the table and partition metadata 
> from within Drill, as opposed to querying the parquet metadata underneath.
> Potentially extend INFORMATION_SCHEMA with an additional PARTITIONS table 
> similar to MySQL to provide information on column(s) used for partitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4842) SELECT * on JSON data results in NumberFormatException

2016-09-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510243#comment-15510243
 ] 

ASF GitHub Bot commented on DRILL-4842:
---

GitHub user Serhii-Harnyk opened a pull request:

https://github.com/apache/drill/pull/594

DRILL-4842: SELECT * on JSON data results in NumberFormatException



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Serhii-Harnyk/drill DRILL-4842

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/594.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #594


commit 190a69a65ad8c144b164c2acacf9718a0ecb3768
Author: Serhii-Harnyk 
Date:   2016-09-08T18:11:37Z

DRILL-4842: SELECT * on JSON data results in NumberFormatException




> SELECT * on JSON data results in NumberFormatException
> --
>
> Key: DRILL-4842
> URL: https://issues.apache.org/jira/browse/DRILL-4842
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.2.0
>Reporter: Khurram Faraaz
>Assignee: Serhii Harnyk
> Attachments: tooManyNulls.json
>
>
> Note that doing SELECT c1 returns correct results, the failure is seen when 
> we do SELECT star. json.all_text_mode was set to true.
> JSON file tooManyNulls.json has one key c1 with 4096 nulls as its value and 
> the 4097th key c1 has the value "Hello World"
> git commit ID : aaf220ff
> MapR Drill 1.8.0 RPM
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> alter session set 
> `store.json.all_text_mode`=true;
> +---++
> |  ok   |  summary   |
> +---++
> | true  | store.json.all_text_mode updated.  |
> +---++
> 1 row selected (0.27 seconds)
> 0: jdbc:drill:schema=dfs.tmp> SELECT c1 FROM `tooManyNulls.json` WHERE c1 IN 
> ('Hello World');
> +--+
> |  c1  |
> +--+
> | Hello World  |
> +--+
> 1 row selected (0.243 seconds)
> 0: jdbc:drill:schema=dfs.tmp> select * FROM `tooManyNulls.json` WHERE c1 IN 
> ('Hello World');
> Error: SYSTEM ERROR: NumberFormatException: Hello World
> Fragment 0:0
> [Error Id: 9cafb3f9-3d5c-478a-b55c-900602b8765e on centos-01.qa.lab:31010]
>  (java.lang.NumberFormatException) Hello World
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI():95
> 
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varTypesToInt():120
> org.apache.drill.exec.test.generated.FiltererGen1169.doSetup():45
> org.apache.drill.exec.test.generated.FiltererGen1169.setup():54
> 
> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.generateSV2Filterer():195
> 
> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.setupNewSchema():107
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():78
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> 
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> 
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():415
> org.apache.hadoop.security.UserGroupInformation.doAs():1595
> 

[jira] [Commented] (DRILL-4899) Hive Plugin goes to disabled status with restart of Drill and ZK

2016-09-21 Thread Andries Engelbrecht (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511572#comment-15511572
 ] 

Andries Engelbrecht commented on DRILL-4899:


In this case the Hive Plugin config details are retained, but the plugin itself 
is disabled on startup although it was enabled before shutdown.

> Hive Plugin goes to disabled status with restart of Drill and ZK
> 
>
> Key: DRILL-4899
> URL: https://issues.apache.org/jira/browse/DRILL-4899
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Hive
>Affects Versions: 1.8.0
>Reporter: Andries Engelbrecht
>
> When restarting ZK and Drill the Hive storage plugin is disabled by default 
> and requires manual steps to enable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4900) Query across Sybase and Oracle plugins is dropping WHERE clause

2016-09-21 Thread Robert DeVito (JIRA)
Robert DeVito created DRILL-4900:


 Summary: Query across Sybase and Oracle plugins is dropping WHERE 
clause
 Key: DRILL-4900
 URL: https://issues.apache.org/jira/browse/DRILL-4900
 Project: Apache Drill
  Issue Type: Bug
  Components: Client - JDBC, Storage - JDBC
Affects Versions: 1.6.0
 Environment: Windows client. Sybase and Oracle hosts on unix
Reporter: Robert DeVito


Have tried several approaches of joining simple queries with Oracle and Sybase. 
In all cases, we have sufficient WHERE clause on each side to really limit 
data. Each time, the Drill execution plan skips the WHERE clause on one side.
ex:
select a.f, b.b
from
(
select * from  pl1.`owner`.`dbo`.`VIEW1` d
where d.fid = '300769'
and d.PDate = ('2013-10-31') 
) a,
(
select * from pl2.owner.VIEW2 v
where v.f = '300769'
and v.d = 'M'
and v.b  IN 
('UK221','UK222','UK223','UK224','UK225','UK227','08843','BU5552','BU5543','BU5544')
and v.dk = '20131031'
) b
where a.f = b.f
and a.S   = b.S

Please ignore the obfuscated column names. Syntax is valid, but Drill keeps 
sending selects with no WHERE clause for one subquery or the other. Can't 
understand why, or how to control it. This is a make or break for us.

Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4899) Hive Plugin goes to disabled status with restart of Drill and ZK

2016-09-21 Thread Zelaine Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511421#comment-15511421
 ] 

Zelaine Fong commented on DRILL-4899:
-

Related to DRILL-4879?

> Hive Plugin goes to disabled status with restart of Drill and ZK
> 
>
> Key: DRILL-4899
> URL: https://issues.apache.org/jira/browse/DRILL-4899
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Hive
>Affects Versions: 1.8.0
>Reporter: Andries Engelbrecht
>
> When restarting ZK and Drill the Hive storage plugin is disabled by default 
> and requires manual steps to enable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4898) wrong results : Query on directory containing CSV data

2016-09-21 Thread Khurram Faraaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Khurram Faraaz updated DRILL-4898:
--
Description: 
incorrect results : Query on directory containing CSV data
directory has 4534327 number of rows ~ 4.5M records  (there are 6 CSV files)
Drill 1.9.0 commit ID: f3c26e34
I can share the data to reproduce the issue.

Note that data in columns[3] has the value "B02512\r" in query results.
{noformat} 
0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5;
+--+
| columns  |
+--+
| ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"]  |
| ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"]   |
| ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"]  |
| ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"]  |
| ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"]  |
+--+
5 rows selected (0.184 seconds)
{noformat}

But when we do a select on columns[3] we see a different value in the query 
result.
{noformat}
0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5;
+--+
|  EXPR$0  |
+--+
  |02512
  |02512
  |02512
  |02512
  |02512
+--+
5 rows selected (0.159 seconds)  
{noformat}

Searching for 'B02512' returns no rows. (where as it should have returned data)
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` where 
columns[3]='B02512';
+--+
| columns  |
+--+
+--+
No rows selected (1.707 seconds)
{noformat}


  was:
incorrect results : Query on directory containing CSV data
directory has 4534327 number of rows ~ 4.5M records  (there are 6 CSV files)
Drill 1.9.0 commit ID: f3c26e34
Data is available here - /home/MAPRTECH/qa/drill/uber_trip_data

Note that data in columns[3] has the value "B02512\r" in query results.
{noformat} 
0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5;
+--+
| columns  |
+--+
| ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"]  |
| ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"]   |
| ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"]  |
| ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"]  |
| ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"]  |
+--+
5 rows selected (0.184 seconds)
{noformat}

But when we do a select on columns[3] we see a different value in the query 
result.
{noformat}
0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5;
+--+
|  EXPR$0  |
+--+
  |02512
  |02512
  |02512
  |02512
  |02512
+--+
5 rows selected (0.159 seconds)  
{noformat}

Searching for 'B02512' returns no rows. (where as it should have returned data)
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` where 
columns[3]='B02512';
+--+
| columns  |
+--+
+--+
No rows selected (1.707 seconds)
{noformat}



> wrong results : Query on directory containing CSV data
> --
>
> Key: DRILL-4898
> URL: https://issues.apache.org/jira/browse/DRILL-4898
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.9.0
> Environment: 4 node cluster
>Reporter: Khurram Faraaz
>
> incorrect results : Query on directory containing CSV data
> directory has 4534327 number of rows ~ 4.5M records  (there are 6 CSV files)
> Drill 1.9.0 commit ID: f3c26e34
> I can share the data to reproduce the issue.
> Note that data in columns[3] has the value "B02512\r" in query results.
> {noformat} 
> 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5;
> +--+
> | columns  |
> +--+
> | ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"]  |
> | ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"]   |
> | ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"]  |
> | ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"]  |
> | ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"]  |
> +--+
> 5 rows selected (0.184 seconds)
> {noformat}
> But when we do a select on columns[3] we see a different value in the query 
> result.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5;
> +--+
> |  EXPR$0  |
> 

[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512183#comment-15512183
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
Looks like you are right; the JsonParser is more than a simple tokenizer.

We're not the first to try this: 
http://stackoverflow.com/questions/37511496/recover-from-malformed-json-with-jackson
 (no answer)

I tried an experiment and found that you are on the right track: the way 
you are using the JsonParser can be extended to ignore input until the start of 
the next object. A quick demonstration:

private static void recover(JsonParser parser) throws IOException {
  for ( ; ; ) {
JsonToken token;
try {
  token = parser.nextToken();
} catch( JsonParseException e ) { continue; }
if ( token == null ) return;
if ( token != JsonToken.END_OBJECT ) { continue; }
token = parser.nextToken();
if ( token == null ) return;
if ( token == JsonToken.START_OBJECT ) { return; }
  }
}

Basically, we keep reading tokens, and ignoring errors, until we 
successfully find the } { pair.

As we discussed before, to use the above in Drill, we have to discard the 
partly-built record, and start reading the next record assiming the parser is 
positioned **after** the START_OBJECT ("{") token, which we've already 
consumed. That should be simple.

Still, to do proper recovery, we have to discard the partly-built JSON 
record. I've not looked into how to do that. If we don't do that, we return the 
bogus partly-built record. Worse, if we recover by trying to build a new 
record, we create more partly-built records, but with a different schema, 
possibly triggering a schema change event when not really necessary.

Any ideas for how to solve that problem?



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4899) Hive Plugin goes to disabled status with restart of Drill and ZK

2016-09-21 Thread Andries Engelbrecht (JIRA)
Andries Engelbrecht created DRILL-4899:
--

 Summary: Hive Plugin goes to disabled status with restart of Drill 
and ZK
 Key: DRILL-4899
 URL: https://issues.apache.org/jira/browse/DRILL-4899
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Hive
Affects Versions: 1.8.0
Reporter: Andries Engelbrecht


When restarting ZK and Drill the Hive storage plugin is disabled by default and 
requires manual steps to enable. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)