Re: CSV with windows carriage return causes issues

Edmon Begoli Wed, 30 Sep 2015 10:03:25 -0700

I have this already filled in under:
https://issues.apache.org/jira/browse/DRILL-3726


On Wed, Sep 30, 2015 at 11:47 AM, rahul challapalli <
[email protected]> wrote:

> Looks like a bug to me. Can you raise a jira for this if you haven't done
> it already
> On Sep 30, 2015 8:04 AM, <[email protected]> wrote:
>
> > I've seen that issue too... ;)
> >
> > My personal opinion is that Drill (and sqlline) should treat Windows
> > end-of-line characters the same as Unix end-of-line characters.  It
> doesn't
> > seem reasonable to expect users (especially in an enterprise setting) to
> > use dos2unix on every data file just so they can trust their query
> results.
> >
> >
> > Phil
> >
> >
> > Philip A Grim II
> > Chief Engineer
> > L-3 Data Tactics
> > 7901 Jones Branch Dr.
> > Suite 700
> > McLean, VA  22102
> >
> >
> > ________________________________________
> > From: Christopher Matta [[email protected]]
> > Sent: Wednesday, September 30, 2015 10:56 AM
> > To: [email protected]
> > Subject: CSV with windows carriage return causes issues
> >
> > I’ve created a very simple reproduction of an issue I’ve observed with
> > files that have a carriage return (\r) instead of a line feed (\n)
> ending.
> >
> > My CSV file was created using notepad on Windows and looks like this when
> > queried directly from drill:
> >
> > 0: jdbc:drill:zk=local> select * from
> > dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
> > +-----------------------------------------------+
> > |                    columns                    |
> > +-----------------------------------------------+
> > | ["11111","test1","test2","test3","test4\r"]   |
> > | ["22222","test5","test6","test7","test8\r"]   |
> > | ["33333","test9","test10","test11","test12"]  |
> > +-----------------------------------------------+
> >
> > As you can see the first two rows have \r at the end, also note that
> > column[0] has five digits.
> >
> > When casting into their own columns the a column gets a digit truncated:
> >
> > 0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a,
> > cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32))
> > as c, cast(columns[3] as varchar(32)) as d, cast(columns[4] as
> > varchar(32)) as e from
> > dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
> > +--------+--------+---------+---------+---------+
> > |   a    |   b    |    c    |    d    |    e    |
> > +--------+--------+---------+---------+---------+
> >   |1111  | test1  | test2   | test3   | test4
> >   |2222  | test5  | test6   | test7   | test8
> > | 33333  | test9  | test10  | test11  | test12  |
> > +--------+--------+---------+---------+---------+
> >
> > I can get around this by using regexp_replace on the last column:
> >
> > 0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a,
> > cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32))
> > as c, cast(columns[3] as varchar(32)) as d,
> > cast(regexp_replace(columns[4], '\r', '') as varchar(32)) as e from
> > dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
> > +--------+--------+---------+---------+---------+
> > |   a    |   b    |    c    |    d    |    e    |
> > +--------+--------+---------+---------+---------+
> > | 11111  | test1  | test2   | test3   | test4   |
> > | 22222  | test5  | test6   | test7   | test8   |
> > | 33333  | test9  | test10  | test11  | test12  |
> > +--------+--------+---------+---------+---------+
> >
> > Is this expected, or should Drill treat carriage returns as line feeds?
> Is
> > this simply sqlline interpreting the \r character?
> > Chris [email protected]
> > 215-701-3146
> > 
> >
>

Re: CSV with windows carriage return causes issues

Reply via email to