I’ve created a very simple reproduction of an issue I’ve observed with
files that have a carriage return (\r) instead of a line feed (\n) ending.

My CSV file was created using notepad on Windows and looks like this when
queried directly from drill:

0: jdbc:drill:zk=local> select * from
dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
+-----------------------------------------------+
|                    columns                    |
+-----------------------------------------------+
| ["11111","test1","test2","test3","test4\r"]   |
| ["22222","test5","test6","test7","test8\r"]   |
| ["33333","test9","test10","test11","test12"]  |
+-----------------------------------------------+

As you can see the first two rows have \r at the end, also note that
column[0] has five digits.

When casting into their own columns the a column gets a digit truncated:

0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a,
cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32))
as c, cast(columns[3] as varchar(32)) as d, cast(columns[4] as
varchar(32)) as e from
dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
+--------+--------+---------+---------+---------+
|   a    |   b    |    c    |    d    |    e    |
+--------+--------+---------+---------+---------+
  |1111  | test1  | test2   | test3   | test4
  |2222  | test5  | test6   | test7   | test8
| 33333  | test9  | test10  | test11  | test12  |
+--------+--------+---------+---------+---------+

I can get around this by using regexp_replace on the last column:

0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a,
cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32))
as c, cast(columns[3] as varchar(32)) as d,
cast(regexp_replace(columns[4], '\r', '') as varchar(32)) as e from
dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
+--------+--------+---------+---------+---------+
|   a    |   b    |    c    |    d    |    e    |
+--------+--------+---------+---------+---------+
| 11111  | test1  | test2   | test3   | test4   |
| 22222  | test5  | test6   | test7   | test8   |
| 33333  | test9  | test10  | test11  | test12  |
+--------+--------+---------+---------+---------+

Is this expected, or should Drill treat carriage returns as line feeds? Is
this simply sqlline interpreting the \r character?
Chris [email protected]
215-701-3146
​

Reply via email to