RE: CSV with windows carriage return causes issues

rahul challapalli Wed, 30 Sep 2015 08:48:41 -0700

Looks like a bug to me. Can you raise a jira for this if you haven't done
it already
On Sep 30, 2015 8:04 AM, <[email protected]> wrote:


> I've seen that issue too... ;)
>
> My personal opinion is that Drill (and sqlline) should treat Windows
> end-of-line characters the same as Unix end-of-line characters.  It doesn't
> seem reasonable to expect users (especially in an enterprise setting) to
> use dos2unix on every data file just so they can trust their query results.
>
>
> Phil
>
>
> Philip A Grim II
> Chief Engineer
> L-3 Data Tactics
> 7901 Jones Branch Dr.
> Suite 700
> McLean, VA  22102
>
>
> ________________________________________
> From: Christopher Matta [[email protected]]
> Sent: Wednesday, September 30, 2015 10:56 AM
> To: [email protected]
> Subject: CSV with windows carriage return causes issues
>
> I’ve created a very simple reproduction of an issue I’ve observed with
> files that have a carriage return (\r) instead of a line feed (\n) ending.
>
> My CSV file was created using notepad on Windows and looks like this when
> queried directly from drill:
>
> 0: jdbc:drill:zk=local> select * from
> dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
> +-----------------------------------------------+
> |                    columns                    |
> +-----------------------------------------------+
> | ["11111","test1","test2","test3","test4\r"]   |
> | ["22222","test5","test6","test7","test8\r"]   |
> | ["33333","test9","test10","test11","test12"]  |
> +-----------------------------------------------+
>
> As you can see the first two rows have \r at the end, also note that
> column[0] has five digits.
>
> When casting into their own columns the a column gets a digit truncated:
>
> 0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a,
> cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32))
> as c, cast(columns[3] as varchar(32)) as d, cast(columns[4] as
> varchar(32)) as e from
> dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
> +--------+--------+---------+---------+---------+
> |   a    |   b    |    c    |    d    |    e    |
> +--------+--------+---------+---------+---------+
>   |1111  | test1  | test2   | test3   | test4
>   |2222  | test5  | test6   | test7   | test8
> | 33333  | test9  | test10  | test11  | test12  |
> +--------+--------+---------+---------+---------+
>
> I can get around this by using regexp_replace on the last column:
>
> 0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a,
> cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32))
> as c, cast(columns[3] as varchar(32)) as d,
> cast(regexp_replace(columns[4], '\r', '') as varchar(32)) as e from
> dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
> +--------+--------+---------+---------+---------+
> |   a    |   b    |    c    |    d    |    e    |
> +--------+--------+---------+---------+---------+
> | 11111  | test1  | test2   | test3   | test4   |
> | 22222  | test5  | test6   | test7   | test8   |
> | 33333  | test9  | test10  | test11  | test12  |
> +--------+--------+---------+---------+---------+
>
> Is this expected, or should Drill treat carriage returns as line feeds? Is
> this simply sqlline interpreting the \r character?
> Chris [email protected]
> 215-701-3146
> 
>

RE: CSV with windows carriage return causes issues

Reply via email to