Looks like a bug to me. Can you raise a jira for this if you haven't done it already On Sep 30, 2015 8:04 AM, <[email protected]> wrote:
> I've seen that issue too... ;) > > My personal opinion is that Drill (and sqlline) should treat Windows > end-of-line characters the same as Unix end-of-line characters. It doesn't > seem reasonable to expect users (especially in an enterprise setting) to > use dos2unix on every data file just so they can trust their query results. > > > Phil > > > Philip A Grim II > Chief Engineer > L-3 Data Tactics > 7901 Jones Branch Dr. > Suite 700 > McLean, VA 22102 > > > ________________________________________ > From: Christopher Matta [[email protected]] > Sent: Wednesday, September 30, 2015 10:56 AM > To: [email protected] > Subject: CSV with windows carriage return causes issues > > I’ve created a very simple reproduction of an issue I’ve observed with > files that have a carriage return (\r) instead of a line feed (\n) ending. > > My CSV file was created using notepad on Windows and looks like this when > queried directly from drill: > > 0: jdbc:drill:zk=local> select * from > dfs.`Users/cmatta/Downloads/windows_drill_test.csv`; > +-----------------------------------------------+ > | columns | > +-----------------------------------------------+ > | ["11111","test1","test2","test3","test4\r"] | > | ["22222","test5","test6","test7","test8\r"] | > | ["33333","test9","test10","test11","test12"] | > +-----------------------------------------------+ > > As you can see the first two rows have \r at the end, also note that > column[0] has five digits. > > When casting into their own columns the a column gets a digit truncated: > > 0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a, > cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32)) > as c, cast(columns[3] as varchar(32)) as d, cast(columns[4] as > varchar(32)) as e from > dfs.`Users/cmatta/Downloads/windows_drill_test.csv`; > +--------+--------+---------+---------+---------+ > | a | b | c | d | e | > +--------+--------+---------+---------+---------+ > |1111 | test1 | test2 | test3 | test4 > |2222 | test5 | test6 | test7 | test8 > | 33333 | test9 | test10 | test11 | test12 | > +--------+--------+---------+---------+---------+ > > I can get around this by using regexp_replace on the last column: > > 0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a, > cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32)) > as c, cast(columns[3] as varchar(32)) as d, > cast(regexp_replace(columns[4], '\r', '') as varchar(32)) as e from > dfs.`Users/cmatta/Downloads/windows_drill_test.csv`; > +--------+--------+---------+---------+---------+ > | a | b | c | d | e | > +--------+--------+---------+---------+---------+ > | 11111 | test1 | test2 | test3 | test4 | > | 22222 | test5 | test6 | test7 | test8 | > | 33333 | test9 | test10 | test11 | test12 | > +--------+--------+---------+---------+---------+ > > Is this expected, or should Drill treat carriage returns as line feeds? Is > this simply sqlline interpreting the \r character? > Chris [email protected] > 215-701-3146 > >
