RE: CSV with windows carriage return causes issues

Philip.Grim Wed, 30 Sep 2015 08:06:55 -0700

I've seen that issue too... ;)

My personal opinion is that Drill (and sqlline) should treat Windows 
end-of-line characters the same as Unix end-of-line characters.  It doesn't 
seem reasonable to expect users (especially in an enterprise setting) to use 
dos2unix on every data file just so they can trust their query results.



Phil


Philip A Grim II
Chief Engineer
L-3 Data Tactics
7901 Jones Branch Dr.
Suite 700
McLean, VA  22102


________________________________________
From: Christopher Matta [[email protected]]
Sent: Wednesday, September 30, 2015 10:56 AM
To: [email protected]
Subject: CSV with windows carriage return causes issues

I’ve created a very simple reproduction of an issue I’ve observed with
files that have a carriage return (\r) instead of a line feed (\n) ending.

My CSV file was created using notepad on Windows and looks like this when
queried directly from drill:

0: jdbc:drill:zk=local> select * from
dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
+-----------------------------------------------+
|                    columns                    |
+-----------------------------------------------+
| ["11111","test1","test2","test3","test4\r"]   |
| ["22222","test5","test6","test7","test8\r"]   |
| ["33333","test9","test10","test11","test12"]  |
+-----------------------------------------------+

As you can see the first two rows have \r at the end, also note that
column[0] has five digits.

When casting into their own columns the a column gets a digit truncated:

0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a,
cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32))
as c, cast(columns[3] as varchar(32)) as d, cast(columns[4] as
varchar(32)) as e from
dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
+--------+--------+---------+---------+---------+
|   a    |   b    |    c    |    d    |    e    |
+--------+--------+---------+---------+---------+
  |1111  | test1  | test2   | test3   | test4
  |2222  | test5  | test6   | test7   | test8
| 33333  | test9  | test10  | test11  | test12  |
+--------+--------+---------+---------+---------+

I can get around this by using regexp_replace on the last column:

0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a,
cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32))
as c, cast(columns[3] as varchar(32)) as d,
cast(regexp_replace(columns[4], '\r', '') as varchar(32)) as e from
dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
+--------+--------+---------+---------+---------+
|   a    |   b    |    c    |    d    |    e    |
+--------+--------+---------+---------+---------+
| 11111  | test1  | test2   | test3   | test4   |
| 22222  | test5  | test6   | test7   | test8   |
| 33333  | test9  | test10  | test11  | test12  |
+--------+--------+---------+---------+---------+

Is this expected, or should Drill treat carriage returns as line feeds? Is
this simply sqlline interpreting the \r character?
Chris [email protected]
215-701-3146

RE: CSV with windows carriage return causes issues

Reply via email to