RE: Does drill recognize new line correctly?

Jacques Nadeau Wed, 06 Jan 2016 21:15:08 -0800

Got it. It looks like the problem is the like function not handling
wildcards with newlines.  I'm guessing we would see the same problem with
an embedded line break in a parquet file or jdbc source.  Can you file a
bug?
On Jan 6, 2016 6:45 PM, <[email protected]> wrote:


> For CSV:
> I've realized why newline is not be supported.
>
>
> For JSON:
> I mentioned WHERE clause to b in your case.
>
> escaped_break.json is same as yours.
>
>
> 0: jdbc:drill:zk=local> select * from `escaped_break.json` where b like
> 'hello%';
> +----+--------+
> | a  |   b    |
> +----+--------+
> | 4  | hello  |
> +----+--------+
> 1 row selected (0.228 seconds)
>
>
> it can't find {a: 7, b: "hello \ngoodbye"} record.
>
> Thank you.
>
> --
> Miura, Masahide
>
> -----Original Message-----
> From: Jacques Nadeau [mailto:[email protected]]
> Sent: Thursday, January 07, 2016 1:29 AM
> To: user
> Subject: Re: Does drill recognize new line correctly?
>
> For CSV:
>
> Drill doesn't currently support newlines within a csv record. The reason
> has to do with supporting parallel reading of a csv file. It seems
> reasonable to add an option for support of this at the cost of
> parallelization capabilities. Can you open a JIRA requesting this feature
> and vote on it? We are more likely to focus on issues that have a number of
> votes.
>
> For JSON:
>
> I think this works. I'm guessing you are having a different problem. For
> example:
>
> $ cat /tmp/escaped_break.json
> {a: 4, b: "hello"}
> {a: 7, b: "hello \ngoodbye"}
>
> $  cat /tmp/break_in_object.json
> {a: 4, b: "hello"}
> {
>   a: 7,
>   b: "hello goodbye"
> }
>
>
> 0: jdbc:drill:zk=local> use dfs.tmp;
> +-------+--------------------------------------+
> |  ok   |               summary                |
> +-------+--------------------------------------+
> | true  | Default schema changed to [dfs.tmp]  |
> +-------+--------------------------------------+
> 1 row selected (0.101 seconds)
> 0: jdbc:drill:zk=local> select * from `escaped_break.json`;
> +----+-----------------+
> | a  |        b        |
> +----+-----------------+
> | 4  | hello           |
> | 7  | hello
> goodbye  |
> +----+-----------------+
> 2 rows selected (0.114 seconds)
> 0: jdbc:drill:zk=local> select * from `break_in_object.json`;
> +----+----------------+
> | a  |       b        |
> +----+----------------+
> | 4  | hello          |
> | 7  | hello goodbye  |
> +----+----------------+
> 2 rows selected (0.111 seconds)
> 0: jdbc:drill:zk=local>
>
> Note that we don't support an actual embedded line break within a string
> value (apparently json requires this to be escaped... I didn't even realize
> the spec requires that).
>
> $ cat /tmp/bad_break_in_string.json
> {a: 10, b: "hello
>   goodbye"
> }
>
> 0: jdbc:drill:zk=local> select * from `bad_break_in_string.json`;
> Error: DATA_READ ERROR: Error parsing JSON - Illegal unquoted character
> ((CTRL-CHAR, code 10)): has to be escaped using backslash to be included in
> string value
>
> File  /tmp/bad_break_in_string.json
> Record  1
> Column  19
> Fragment 0:0
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Jan 5, 2016 at 8:10 PM, <[email protected]> wrote:
>
> > Happy new year!
> >   We, Japanese like new year's greeting ;-)
> >
> > There's two issues in this message.
> >
> > First, CSV file which has value includes new line.
> > The other, JSON file which has value includes new line.
> >
> > 1) CSV
> >   Doesn't drill recognize CSV which has some columns including new line?
> >   For example, CSV file exported from MS-Excel.
> >
> >   I tried some patterns. Quoting column, escaping by \ (like \[LF]),
> > replacing \r or \n...
> >   But all of those are not good for me.
> >
> >   By the way, new lines are approved in CSV columns by RFC, you know.
> >         * https://tools.ietf.org/html/rfc4180
> >   Then I would like to parse such CSV though I know it is
> > informational definition.
> >
> > 2) JSON
> >   It can't query correctly to JSON with records include new line.
> >
> >   JSON:
> >     { "key": "test record with \n newline" }
> >
> >   Query:
> >     select * from dfs.`test.json` where key like 'test%'
> >
> >   Result:
> >     No result found
> >
> >   It doesn't compare value correctly if it includes new line, I think.
> >
> > Do you know how to use new lines in values as expected?
> >
> > Thank you.
> >
> > --
> > Miura, Masahide
> >
> >
>

RE: Does drill recognize new line correctly?

Reply via email to