Re: Does drill recognize new line correctly?

Peder Jakobsen | gmail Wed, 06 Jan 2016 08:52:52 -0800

Hi, if you are working in Unix, use *iconv* to remove newlines, and other
things like BOMS, converting to UTF-8 etc.


Perhaps Google "iconv remove newlines from csv"..?

iconv is quick, you can process Gibabytes of nested csv files in minutes.

If you are using Windows, I'm not sure.

Peder




On Wed, Jan 6, 2016 at 11:29 AM, Jacques Nadeau <[email protected]> wrote:

> For CSV:
>
> Drill doesn't currently support newlines within a csv record. The reason
> has to do with supporting parallel reading of a csv file. It seems
> reasonable to add an option for support of this at the cost of
> parallelization capabilities. Can you open a JIRA requesting this feature
> and vote on it? We are more likely to focus on issues that have a number of
> votes.
>
> For JSON:
>
> I think this works. I'm guessing you are having a different problem. For
> example:
>
> $ cat /tmp/escaped_break.json
> {a: 4, b: "hello"}
> {a: 7, b: "hello \ngoodbye"}
>
> $  cat /tmp/break_in_object.json
> {a: 4, b: "hello"}
> {
>   a: 7,
>   b: "hello goodbye"
> }
>
>
> 0: jdbc:drill:zk=local> use dfs.tmp;
> +-------+--------------------------------------+
> |  ok   |               summary                |
> +-------+--------------------------------------+
> | true  | Default schema changed to [dfs.tmp]  |
> +-------+--------------------------------------+
> 1 row selected (0.101 seconds)
> 0: jdbc:drill:zk=local> select * from `escaped_break.json`;
> +----+-----------------+
> | a  |        b        |
> +----+-----------------+
> | 4  | hello           |
> | 7  | hello
> goodbye  |
> +----+-----------------+
> 2 rows selected (0.114 seconds)
> 0: jdbc:drill:zk=local> select * from `break_in_object.json`;
> +----+----------------+
> | a  |       b        |
> +----+----------------+
> | 4  | hello          |
> | 7  | hello goodbye  |
> +----+----------------+
> 2 rows selected (0.111 seconds)
> 0: jdbc:drill:zk=local>
>
> Note that we don't support an actual embedded line break within a string
> value (apparently json requires this to be escaped... I didn't even realize
> the spec requires that).
>
> $ cat /tmp/bad_break_in_string.json
> {a: 10, b: "hello
>   goodbye"
> }
>
> 0: jdbc:drill:zk=local> select * from `bad_break_in_string.json`;
> Error: DATA_READ ERROR: Error parsing JSON - Illegal unquoted character
> ((CTRL-CHAR, code 10)): has to be escaped using backslash to be included in
> string value
>
> File  /tmp/bad_break_in_string.json
> Record  1
> Column  19
> Fragment 0:0
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Jan 5, 2016 at 8:10 PM, <[email protected]> wrote:
>
> > Happy new year!
> >   We, Japanese like new year's greeting ;-)
> >
> > There's two issues in this message.
> >
> > First, CSV file which has value includes new line.
> > The other, JSON file which has value includes new line.
> >
> > 1) CSV
> >   Doesn't drill recognize CSV which has some columns including new line?
> >   For example, CSV file exported from MS-Excel.
> >
> >   I tried some patterns. Quoting column, escaping by \ (like \[LF]),
> > replacing \r or \n...
> >   But all of those are not good for me.
> >
> >   By the way, new lines are approved in CSV columns by RFC, you know.
> >         * https://tools.ietf.org/html/rfc4180
> >   Then I would like to parse such CSV though I know it is informational
> > definition.
> >
> > 2) JSON
> >   It can't query correctly to JSON with records include new line.
> >
> >   JSON:
> >     { "key": "test record with \n newline" }
> >
> >   Query:
> >     select * from dfs.`test.json` where key like 'test%'
> >
> >   Result:
> >     No result found
> >
> >   It doesn't compare value correctly if it includes new line, I think.
> >
> > Do you know how to use new lines in values as expected?
> >
> > Thank you.
> >
> > --
> > Miura, Masahide
> >
> >
>

Re: Does drill recognize new line correctly?

Reply via email to