Re: Does drill recognize new line correctly?

Jacques Nadeau Wed, 06 Jan 2016 08:29:51 -0800

For CSV:

Drill doesn't currently support newlines within a csv record. The reason
has to do with supporting parallel reading of a csv file. It seems
reasonable to add an option for support of this at the cost of
parallelization capabilities. Can you open a JIRA requesting this feature
and vote on it? We are more likely to focus on issues that have a number of
votes.


For JSON:

I think this works. I'm guessing you are having a different problem. For
example:

$ cat /tmp/escaped_break.json
{a: 4, b: "hello"}
{a: 7, b: "hello \ngoodbye"}

$  cat /tmp/break_in_object.json
{a: 4, b: "hello"}
{
  a: 7,
  b: "hello goodbye"
}


0: jdbc:drill:zk=local> use dfs.tmp;
+-------+--------------------------------------+
|  ok   |               summary                |
+-------+--------------------------------------+
| true  | Default schema changed to [dfs.tmp]  |
+-------+--------------------------------------+
1 row selected (0.101 seconds)
0: jdbc:drill:zk=local> select * from `escaped_break.json`;
+----+-----------------+
| a  |        b        |
+----+-----------------+
| 4  | hello           |
| 7  | hello
goodbye  |
+----+-----------------+
2 rows selected (0.114 seconds)
0: jdbc:drill:zk=local> select * from `break_in_object.json`;
+----+----------------+
| a  |       b        |
+----+----------------+
| 4  | hello          |
| 7  | hello goodbye  |
+----+----------------+
2 rows selected (0.111 seconds)
0: jdbc:drill:zk=local>

Note that we don't support an actual embedded line break within a string
value (apparently json requires this to be escaped... I didn't even realize
the spec requires that).

$ cat /tmp/bad_break_in_string.json
{a: 10, b: "hello
  goodbye"
}

0: jdbc:drill:zk=local> select * from `bad_break_in_string.json`;
Error: DATA_READ ERROR: Error parsing JSON - Illegal unquoted character
((CTRL-CHAR, code 10)): has to be escaped using backslash to be included in
string value

File  /tmp/bad_break_in_string.json
Record  1
Column  19
Fragment 0:0

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Jan 5, 2016 at 8:10 PM, <[email protected]> wrote:

> Happy new year!
>   We, Japanese like new year's greeting ;-)
>
> There's two issues in this message.
>
> First, CSV file which has value includes new line.
> The other, JSON file which has value includes new line.
>
> 1) CSV
>   Doesn't drill recognize CSV which has some columns including new line?
>   For example, CSV file exported from MS-Excel.
>
>   I tried some patterns. Quoting column, escaping by \ (like \[LF]),
> replacing \r or \n...
>   But all of those are not good for me.
>
>   By the way, new lines are approved in CSV columns by RFC, you know.
>         * https://tools.ietf.org/html/rfc4180
>   Then I would like to parse such CSV though I know it is informational
> definition.
>
> 2) JSON
>   It can't query correctly to JSON with records include new line.
>
>   JSON:
>     { "key": "test record with \n newline" }
>
>   Query:
>     select * from dfs.`test.json` where key like 'test%'
>
>   Result:
>     No result found
>
>   It doesn't compare value correctly if it includes new line, I think.
>
> Do you know how to use new lines in values as expected?
>
> Thank you.
>
> --
> Miura, Masahide
>
>

Re: Does drill recognize new line correctly?

Reply via email to