I've seen that issue too... ;) My personal opinion is that Drill (and sqlline) should treat Windows end-of-line characters the same as Unix end-of-line characters. It doesn't seem reasonable to expect users (especially in an enterprise setting) to use dos2unix on every data file just so they can trust their query results.
Phil Philip A Grim II Chief Engineer L-3 Data Tactics 7901 Jones Branch Dr. Suite 700 McLean, VA 22102 ________________________________________ From: Christopher Matta [[email protected]] Sent: Wednesday, September 30, 2015 10:56 AM To: [email protected] Subject: CSV with windows carriage return causes issues I’ve created a very simple reproduction of an issue I’ve observed with files that have a carriage return (\r) instead of a line feed (\n) ending. My CSV file was created using notepad on Windows and looks like this when queried directly from drill: 0: jdbc:drill:zk=local> select * from dfs.`Users/cmatta/Downloads/windows_drill_test.csv`; +-----------------------------------------------+ | columns | +-----------------------------------------------+ | ["11111","test1","test2","test3","test4\r"] | | ["22222","test5","test6","test7","test8\r"] | | ["33333","test9","test10","test11","test12"] | +-----------------------------------------------+ As you can see the first two rows have \r at the end, also note that column[0] has five digits. When casting into their own columns the a column gets a digit truncated: 0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a, cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32)) as c, cast(columns[3] as varchar(32)) as d, cast(columns[4] as varchar(32)) as e from dfs.`Users/cmatta/Downloads/windows_drill_test.csv`; +--------+--------+---------+---------+---------+ | a | b | c | d | e | +--------+--------+---------+---------+---------+ |1111 | test1 | test2 | test3 | test4 |2222 | test5 | test6 | test7 | test8 | 33333 | test9 | test10 | test11 | test12 | +--------+--------+---------+---------+---------+ I can get around this by using regexp_replace on the last column: 0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a, cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32)) as c, cast(columns[3] as varchar(32)) as d, cast(regexp_replace(columns[4], '\r', '') as varchar(32)) as e from dfs.`Users/cmatta/Downloads/windows_drill_test.csv`; +--------+--------+---------+---------+---------+ | a | b | c | d | e | +--------+--------+---------+---------+---------+ | 11111 | test1 | test2 | test3 | test4 | | 22222 | test5 | test6 | test7 | test8 | | 33333 | test9 | test10 | test11 | test12 | +--------+--------+---------+---------+---------+ Is this expected, or should Drill treat carriage returns as line feeds? Is this simply sqlline interpreting the \r character? Chris [email protected] 215-701-3146
