[jira] [Commented] (DRILL-4898) wrong results : Query on directory containing CSV data

Khurram Faraaz (JIRA) Thu, 22 Sep 2016 11:36:16 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514112#comment-15514112
 ]


Khurram Faraaz commented on DRILL-4898:
---------------------------------------

Adding "lineDelimiter": "\r\n" to storage plugin resolves the issue.

{noformat}
Running the query (in the SQLLine session), we get correct results:
0: jdbc:drill:> select columns[3] from dfs.`/uber/uber_trip_data/` limit 2; 
+---------+
| EXPR$0  |
+---------+
| B02512  |
| B02512  |
+---------+
2 rows selected (0.201 seconds)
{noformat}

> wrong results : Query on directory containing CSV data
> ------------------------------------------------------
>
>                 Key: DRILL-4898
>                 URL: https://issues.apache.org/jira/browse/DRILL-4898
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.9.0
>         Environment: 4 node cluster
>            Reporter: Khurram Faraaz
>
> incorrect results : Query on directory containing CSV data
> directory has 4534327 number of rows ~ 4.5M records  (there are 6 CSV files)
> Drill 1.9.0 commit ID: f3c26e34
> I can share the data to reproduce the issue.
> Note that data in columns[3] has the value "B02512\r" in query results.
> {noformat} 
> 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5;
> +----------------------------------------------------------+
> |                         columns                          |
> +----------------------------------------------------------+
> | ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"]  |
> | ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"]   |
> | ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"]  |
> | ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"]  |
> | ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"]  |
> +----------------------------------------------------------+
> 5 rows selected (0.184 seconds)
> {noformat}
> But when we do a select on columns[3] we see a different value in the query 
> result.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5;
> +----------+
> |  EXPR$0  |
> +----------+
>   |02512
>   |02512
>   |02512
>   |02512
>   |02512
> +----------+
> 5 rows selected (0.159 seconds)  
> {noformat}
> Searching for 'B02512' returns no rows. (where as it should have returned 
> data)
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` where 
> columns[3]='B02512';
> +----------+
> | columns  |
> +----------+
> +----------+
> No rows selected (1.707 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4898) wrong results : Query on directory containing CSV data

Reply via email to