Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

2018-08-28 Thread scott
Paul, Thank you for the workaround, that worked in my case perfectly !!

Scott

On Tue, Aug 28, 2018 at 12:20 PM Lee, David  wrote:

> This is a pretty ugly json file.. 568 megs for 7227 records..
>
> => ls -l test.jsonl
> -rw-r--r-- 1 my_login users 568693075 Aug 28 15:15 test.jsonl
>
> There is one difference 7226 vs 7227, but that is from wc..
>
> wc -l is NOT counting last of the file if it does not have end of line
> character
>
> -Original Message-
> From: Lee, David
> Sent: Tuesday, August 28, 2018 12:11 PM
> To: user@drill.apache.org
> Subject: RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read
> from the middle of a record
>
> select count(*) on a jsonl file comes back instantly
>
> /u1/my_login=> wc -l test.jsonl
> 7226 test.jsonl
>
> select count(*) from dfs.`/u1/my_login/test.jsonl`
>
> EXPR$0
> 7227
>
> Overview
> Operator ID TypeAvg Setup Time  Max Setup Time  Avg Process Time
>   Max Process TimeMin Wait Time   Avg Wait Time   Max Wait
> Time   % Fragment Time % Query TimeRowsAvg Peak Memory Max Peak
> Memory
> 00-xx-00JSON_SUB_SCAN   0.000s  0.000s  1.096s  3.287s  0.000s
> 0.181s  0.543s  99.58%  99.58%  7,228   24KB32KB
> 00-xx-01PROJECT 0.001s  0.001s  0.000s  0.000s  0.000s  0.000s
> 0.000s  0.00%   0.00%   1   32KB32KB
> 00-xx-02STREAMING_AGGREGATE 0.022s  0.022s  0.001s  0.001s
> 0.000s  0.000s  0.000s  0.04%   0.04%   1   64KB64KB
> 00-xx-03STREAMING_AGGREGATE 0.040s  0.040s  0.011s  0.011s
> 0.000s  0.000s  0.000s  0.34%   0.34%   7,227   48KB48KB
> 00-xx-04PROJECT 0.032s  0.032s  0.001s  0.001s  0.000s  0.000s
> 0.000s  0.04%   0.04%   7,227   16KB16KB
>
>
> -Original Message-----
> From: Paul Rogers [mailto:par0...@yahoo.com.INVALID]
> Sent: Tuesday, August 28, 2018 11:23 AM
> To: user@drill.apache.org
> Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read
> from the middle of a record
>
> [EXTERNAL EMAIL]
>
>
> Hi Scott,
>
> Bingo. Just tried this very case with the sample file from the previous
> post. Got exactly the failure in the post you provided. I notice that a
> "select *" query returns immediately, but a "count(*)" query hangs for the
> 30+ seconds before it errors out. Mine is only a two-record file, so taking
> 30 seconds to fail is excessive.
>
> Clearly, something is wrong. At the very least, a count(*) should simply
> read all records and discard the data, using exactly the same JSON parser
> as for a "SELECT *" query. That Drill is not doing so suggests that perhaps
> the code is trying to be clever to optimize for the "count(*)" case, and is
> doing so incorrectly.
>
> Here is a clunky workaround: just add a WHERE clause that accepts all
> records:
>
> SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
> +-+| EXPR$0  |+-+| 2   |+-+
>
> As it turns out, I'm in the (very slow) process of issuing PRs for a
> revised JSON record reader to handle other issues. A side effect of that
> change is that the new implementation does use the same parse path for both
> the "SELECT *" an "SELECT count(*)" paths. So, even if someone cannot fix
> this bug short term, there is a longer-term fix coming.
>
> Thanks,
> - Paul
>
>
>
> On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott <
> tcots8...@gmail.com> wrote:
>
>  Paul,
> Thanks for prompting the right questions. I went back and took another
> look at my queries. It turns out that there is some condition that causes
> this error when running functions like "count(*)" on the data to cause this
> error, where a normal unqualified select does not. I also ran across this
> article from MapR that led me to conclude Drill just doesn't support it.
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS=DwIFaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g=
>
> I think if we can confirm exactly which conditions cause the problem, we
> should open a high priority Jira. What do you think?
>
>
> On Mon, Aug 27, 2018 at 11:58 PM Paul Rogers 
> wrote:
>
> > Hi Scott,
> >
> > I created a file, "test.json", using the data from your e-mail:
> >
> > [ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
> >
> > The oldes

RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

2018-08-28 Thread Lee, David
This is a pretty ugly json file.. 568 megs for 7227 records..

=> ls -l test.jsonl
-rw-r--r-- 1 my_login users 568693075 Aug 28 15:15 test.jsonl

There is one difference 7226 vs 7227, but that is from wc..

wc -l is NOT counting last of the file if it does not have end of line character

-Original Message-
From: Lee, David 
Sent: Tuesday, August 28, 2018 12:11 PM
To: user@drill.apache.org
Subject: RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from 
the middle of a record

select count(*) on a jsonl file comes back instantly

/u1/my_login=> wc -l test.jsonl
7226 test.jsonl

select count(*) from dfs.`/u1/my_login/test.jsonl`  

EXPR$0
7227

Overview
Operator ID TypeAvg Setup Time  Max Setup Time  Avg Process Time
Max Process TimeMin Wait Time   Avg Wait Time   Max Wait Time   % 
Fragment Time % Query TimeRowsAvg Peak Memory Max Peak Memory
00-xx-00JSON_SUB_SCAN   0.000s  0.000s  1.096s  3.287s  0.000s  0.181s  
0.543s  99.58%  99.58%  7,228   24KB32KB
00-xx-01PROJECT 0.001s  0.001s  0.000s  0.000s  0.000s  0.000s  0.000s  
0.00%   0.00%   1   32KB32KB
00-xx-02STREAMING_AGGREGATE 0.022s  0.022s  0.001s  0.001s  0.000s  
0.000s  0.000s  0.04%   0.04%   1   64KB64KB
00-xx-03STREAMING_AGGREGATE 0.040s  0.040s  0.011s  0.011s  0.000s  
0.000s  0.000s  0.34%   0.34%   7,227   48KB48KB
00-xx-04PROJECT 0.032s  0.032s  0.001s  0.001s  0.000s  0.000s  0.000s  
0.04%   0.04%   7,227   16KB16KB


-Original Message-
From: Paul Rogers [mailto:par0...@yahoo.com.INVALID]
Sent: Tuesday, August 28, 2018 11:23 AM
To: user@drill.apache.org
Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from 
the middle of a record

[EXTERNAL EMAIL]


Hi Scott,

Bingo. Just tried this very case with the sample file from the previous post. 
Got exactly the failure in the post you provided. I notice that a "select *" 
query returns immediately, but a "count(*)" query hangs for the 30+ seconds 
before it errors out. Mine is only a two-record file, so taking 30 seconds to 
fail is excessive.

Clearly, something is wrong. At the very least, a count(*) should simply read 
all records and discard the data, using exactly the same JSON parser as for a 
"SELECT *" query. That Drill is not doing so suggests that perhaps the code is 
trying to be clever to optimize for the "count(*)" case, and is doing so 
incorrectly.

Here is a clunky workaround: just add a WHERE clause that accepts all records:

SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
+-+| EXPR$0  |+-+| 2   |+-+

As it turns out, I'm in the (very slow) process of issuing PRs for a revised 
JSON record reader to handle other issues. A side effect of that change is that 
the new implementation does use the same parse path for both the "SELECT *" an 
"SELECT count(*)" paths. So, even if someone cannot fix this bug short term, 
there is a longer-term fix coming.

Thanks,
- Paul



On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott  
wrote:

 Paul,
Thanks for prompting the right questions. I went back and took another look at 
my queries. It turns out that there is some condition that causes this error 
when running functions like "count(*)" on the data to cause this error, where a 
normal unqualified select does not. I also ran across this article from MapR 
that led me to conclude Drill just doesn't support it.

https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS=DwIFaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g=

I think if we can confirm exactly which conditions cause the problem, we should 
open a high priority Jira. What do you think?


On Mon, Aug 27, 2018 at 11:58 PM Paul Rogers 
wrote:

> Hi Scott,
>
> I created a file, "test.json", using the data from your e-mail:
>
> [ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
>
> The oldest build I have readily available is Drill 1.13. I ran that as 
> a server, then connected with sqlline as a client. I ran a query:
>
> select * from `test.json`;
> +---+---+| var1  | var2  |+---+---+| foo  | bar  || fo
>  | baz  |+---+---+
>
> I can try with Drill 1.12, once I find and download it. Or, you can 
> try with Drill 1.14 (the latest release.)
>
> I do wonder, however, if we are talking about the same thing. My test 
> puts your JSON in a JSON file with ".json" extension so that Drill 
> choses the JSON parser. I'm using default JSON (session) options.
>
> Is this 

RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

2018-08-28 Thread Lee, David
select count(*) on a jsonl file comes back instantly

/u1/my_login=> wc -l test.jsonl
7226 test.jsonl

select count(*) from dfs.`/u1/my_login/test.jsonl`  

EXPR$0
7227

Overview
Operator ID TypeAvg Setup Time  Max Setup Time  Avg Process Time
Max Process TimeMin Wait Time   Avg Wait Time   Max Wait Time   % 
Fragment Time % Query TimeRowsAvg Peak Memory Max Peak Memory
00-xx-00JSON_SUB_SCAN   0.000s  0.000s  1.096s  3.287s  0.000s  0.181s  
0.543s  99.58%  99.58%  7,228   24KB32KB
00-xx-01PROJECT 0.001s  0.001s  0.000s  0.000s  0.000s  0.000s  0.000s  
0.00%   0.00%   1   32KB32KB
00-xx-02STREAMING_AGGREGATE 0.022s  0.022s  0.001s  0.001s  0.000s  
0.000s  0.000s  0.04%   0.04%   1   64KB64KB
00-xx-03STREAMING_AGGREGATE 0.040s  0.040s  0.011s  0.011s  0.000s  
0.000s  0.000s  0.34%   0.34%   7,227   48KB48KB
00-xx-04PROJECT 0.032s  0.032s  0.001s  0.001s  0.000s  0.000s  0.000s  
0.04%   0.04%   7,227   16KB16KB


-Original Message-
From: Paul Rogers [mailto:par0...@yahoo.com.INVALID] 
Sent: Tuesday, August 28, 2018 11:23 AM
To: user@drill.apache.org
Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from 
the middle of a record

[EXTERNAL EMAIL]


Hi Scott,

Bingo. Just tried this very case with the sample file from the previous post. 
Got exactly the failure in the post you provided. I notice that a "select *" 
query returns immediately, but a "count(*)" query hangs for the 30+ seconds 
before it errors out. Mine is only a two-record file, so taking 30 seconds to 
fail is excessive.

Clearly, something is wrong. At the very least, a count(*) should simply read 
all records and discard the data, using exactly the same JSON parser as for a 
"SELECT *" query. That Drill is not doing so suggests that perhaps the code is 
trying to be clever to optimize for the "count(*)" case, and is doing so 
incorrectly.

Here is a clunky workaround: just add a WHERE clause that accepts all records:

SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
+-+| EXPR$0  |+-+| 2   |+-+

As it turns out, I'm in the (very slow) process of issuing PRs for a revised 
JSON record reader to handle other issues. A side effect of that change is that 
the new implementation does use the same parse path for both the "SELECT *" an 
"SELECT count(*)" paths. So, even if someone cannot fix this bug short term, 
there is a longer-term fix coming.

Thanks,
- Paul



On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott  
wrote:

 Paul,
Thanks for prompting the right questions. I went back and took another look at 
my queries. It turns out that there is some condition that causes this error 
when running functions like "count(*)" on the data to cause this error, where a 
normal unqualified select does not. I also ran across this article from MapR 
that led me to conclude Drill just doesn't support it.

https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS=DwIFaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g=

I think if we can confirm exactly which conditions cause the problem, we should 
open a high priority Jira. What do you think?


On Mon, Aug 27, 2018 at 11:58 PM Paul Rogers 
wrote:

> Hi Scott,
>
> I created a file, "test.json", using the data from your e-mail:
>
> [ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
>
> The oldest build I have readily available is Drill 1.13. I ran that as 
> a server, then connected with sqlline as a client. I ran a query:
>
> select * from `test.json`;
> +---+---+| var1  | var2  |+---+---+| foo  | bar  || fo
>  | baz  |+---+---+
>
> I can try with Drill 1.12, once I find and download it. Or, you can 
> try with Drill 1.14 (the latest release.)
>
> I do wonder, however, if we are talking about the same thing. My test 
> puts your JSON in a JSON file with ".json" extension so that Drill 
> choses the JSON parser. I'm using default JSON (session) options.
>
> Is this what you are doing? Or, is your JSON coming from some other 
> source? Kafka? A field from a CSV file, say?
>
> Thanks,
> - Paul
>
>
>
>    On Monday, August 27, 2018, 10:31:00 PM PDT, scott <  
>tcots8...@gmail.com> wrote:
>
>  Paul,
> I'm using version 1.12. Can you tell me what version you think that 
>was  fixed in? The ticket I referenced is still open, with no comments.
>
> Scott
>
> On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers 
>

Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

2018-08-28 Thread Paul Rogers
Hi Scott,

Bingo. Just tried this very case with the sample file from the previous post. 
Got exactly the failure in the post you provided. I notice that a "select *" 
query returns immediately, but a "count(*)" query hangs for the 30+ seconds 
before it errors out. Mine is only a two-record file, so taking 30 seconds to 
fail is excessive.

Clearly, something is wrong. At the very least, a count(*) should simply read 
all records and discard the data, using exactly the same JSON parser as for a 
"SELECT *" query. That Drill is not doing so suggests that perhaps the code is 
trying to be clever to optimize for the "count(*)" case, and is doing so 
incorrectly.

Here is a clunky workaround: just add a WHERE clause that accepts all records:

SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
+-+| EXPR$0  |+-+| 2       |+-+

As it turns out, I'm in the (very slow) process of issuing PRs for a revised 
JSON record reader to handle other issues. A side effect of that change is that 
the new implementation does use the same parse path for both the "SELECT *" an 
"SELECT count(*)" paths. So, even if someone cannot fix this bug short term, 
there is a longer-term fix coming.

Thanks,
- Paul

 

On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott  
wrote:  
 
 Paul,
Thanks for prompting the right questions. I went back and took another look
at my queries. It turns out that there is some condition that causes this
error when running functions like "count(*)" on the data to cause this
error, where a normal unqualified select does not. I also ran across this
article from MapR that led me to conclude Drill just doesn't support it.

https://mapr.com/support/s/article/Apache-Drill-cannot-read-from-middle-of-a-record?language=en_US

I think if we can confirm exactly which conditions cause the problem, we
should open a high priority Jira. What do you think?


On Mon, Aug 27, 2018 at 11:58 PM Paul Rogers 
wrote:

> Hi Scott,
>
> I created a file, "test.json", using the data from your e-mail:
>
> [ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
>
> The oldest build I have readily available is Drill 1.13. I ran that as a
> server, then connected with sqlline as a client. I ran a query:
>
> select * from `test.json`;
> +---+---+| var1  | var2  |+---+---+| foo  | bar  || fo
>  | baz  |+---+---+
>
> I can try with Drill 1.12, once I find and download it. Or, you can try
> with Drill 1.14 (the latest release.)
>
> I do wonder, however, if we are talking about the same thing. My test puts
> your JSON in a JSON file with ".json" extension so that Drill choses the
> JSON parser. I'm using default JSON (session) options.
>
> Is this what you are doing? Or, is your JSON coming from some other
> source? Kafka? A field from a CSV file, say?
>
> Thanks,
> - Paul
>
>
>
>    On Monday, August 27, 2018, 10:31:00 PM PDT, scott <
> tcots8...@gmail.com> wrote:
>
>  Paul,
> I'm using version 1.12. Can you tell me what version you think that was
> fixed in? The ticket I referenced is still open, with no comments.
>
> Scott
>
> On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers 
> wrote:
>
> > Hi David,
> >
> > JSON files are never splittable: there is no single-character way to find
> > the start of a JSON record within a file.
> >
> > Drill is supposed to support two JSON formats: the array format from the
> > earlier post, and the non-JSON (but very common) list of objects format
> in
> > this example.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
> > david@blackrock.com> wrote:
> >
> >  Get rid of the opening and closing brackets and see if you can turn the
> > commas into newlines.. The file needs to be splittable I think to reduce
> > memory overhead vs parsing a giant string...
> >
> > {"var1": "foo", "var2":"bar"}
> > {"var1": "fo", "var2": "baz"}
> > {"var1": "f2o", "var2": "baz2"}
> > {"var1": "f3o", "var2": "baz3"}
> > {"var1": "f4o", "var2": "baz4"}
> > {"var1": "f5o", "var2": "baz5"}
> >
> > -Original Message-
> > From: scott [mailto:tcots8...@gmail.com]
> > Sent: Monday, August 27, 2018 4:59 PM
> > To: user@drill.apache.org
> > Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from
> the
> > middle of a record
> >
> > [EXTERNAL EMAIL]
> >
> >
> > Hi All,
> > I'm getting an error querying some of my json files.
> > The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON -
> > Cannot read from the middle of a record. Current token was START_ARRAY
> >
> > The json files are in array format, like [ { "var1": "foo", "var2":
> > "bar"},{"var1": "fo", "var2": "baz"}]
> >
> > I found a ticket that indicates this format is not supported by Drill
> yet,
> > DRILL-1755 <
> >
> 

Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

2018-08-28 Thread scott
Paul,
Thanks for prompting the right questions. I went back and took another look
at my queries. It turns out that there is some condition that causes this
error when running functions like "count(*)" on the data to cause this
error, where a normal unqualified select does not. I also ran across this
article from MapR that led me to conclude Drill just doesn't support it.

https://mapr.com/support/s/article/Apache-Drill-cannot-read-from-middle-of-a-record?language=en_US

I think if we can confirm exactly which conditions cause the problem, we
should open a high priority Jira. What do you think?


On Mon, Aug 27, 2018 at 11:58 PM Paul Rogers 
wrote:

> Hi Scott,
>
> I created a file, "test.json", using the data from your e-mail:
>
> [ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
>
> The oldest build I have readily available is Drill 1.13. I ran that as a
> server, then connected with sqlline as a client. I ran a query:
>
> select * from `test.json`;
> +---+---+| var1  | var2  |+---+---+| foo   | bar   || fo
>  | baz   |+---+---+
>
> I can try with Drill 1.12, once I find and download it. Or, you can try
> with Drill 1.14 (the latest release.)
>
> I do wonder, however, if we are talking about the same thing. My test puts
> your JSON in a JSON file with ".json" extension so that Drill choses the
> JSON parser. I'm using default JSON (session) options.
>
> Is this what you are doing? Or, is your JSON coming from some other
> source? Kafka? A field from a CSV file, say?
>
> Thanks,
> - Paul
>
>
>
> On Monday, August 27, 2018, 10:31:00 PM PDT, scott <
> tcots8...@gmail.com> wrote:
>
>  Paul,
> I'm using version 1.12. Can you tell me what version you think that was
> fixed in? The ticket I referenced is still open, with no comments.
>
> Scott
>
> On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers 
> wrote:
>
> > Hi David,
> >
> > JSON files are never splittable: there is no single-character way to find
> > the start of a JSON record within a file.
> >
> > Drill is supposed to support two JSON formats: the array format from the
> > earlier post, and the non-JSON (but very common) list of objects format
> in
> > this example.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
> > david@blackrock.com> wrote:
> >
> >  Get rid of the opening and closing brackets and see if you can turn the
> > commas into newlines.. The file needs to be splittable I think to reduce
> > memory overhead vs parsing a giant string...
> >
> > {"var1": "foo", "var2":"bar"}
> > {"var1": "fo", "var2": "baz"}
> > {"var1": "f2o", "var2": "baz2"}
> > {"var1": "f3o", "var2": "baz3"}
> > {"var1": "f4o", "var2": "baz4"}
> > {"var1": "f5o", "var2": "baz5"}
> >
> > -Original Message-
> > From: scott [mailto:tcots8...@gmail.com]
> > Sent: Monday, August 27, 2018 4:59 PM
> > To: user@drill.apache.org
> > Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from
> the
> > middle of a record
> >
> > [EXTERNAL EMAIL]
> >
> >
> > Hi All,
> > I'm getting an error querying some of my json files.
> > The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON -
> > Cannot read from the middle of a record. Current token was START_ARRAY
> >
> > The json files are in array format, like [ { "var1": "foo", "var2":
> > "bar"},{"var1": "fo", "var2": "baz"}]
> >
> > I found a ticket that indicates this format is not supported by Drill
> yet,
> > DRILL-1755 <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_DRILL-2D1755=DwIBaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQoFvFKpdJo=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc=
> >
> > , but I find it hard to believe there is no workaround or solution since
> > this was reported
> > 4 years back. Does anyone have a solution or workaround to this problem?
> >
> > Thanks,
> > Scott
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> immediately
> > and delete this message. See
> > http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers
> for
> > further information.  Please refer to
> > http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for
> > more information about BlackRock’s Privacy Policy.
> >
> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
> >
> > © 2018 BlackRock, Inc. All rights reserved.
> >


RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

2018-08-28 Thread Lee, David
The other JSON format is officially JSONL.. Can we in the next version of drill 
in Storage Plugins by default include jsonl in extensions??

http://jsonlines.org/

From:

"json": {
  "type": "json",
  "extensions": [
"json"
  ]
},

To

"json": {
  "type": "json",
  "extensions": [
"json", "jsonl"
  ]
},

After working with both JSON and JSONL, JSONL is so much easier to work with 
using other tools and programming languages..

A simple linux GREP command can be used to find data, but trying to GREP a JSON 
file with no line breaks just returns back a wall of text..


-Original Message-
From: Paul Rogers [mailto:par0...@yahoo.com.INVALID] 
Sent: Monday, August 27, 2018 5:47 PM
To: user@drill.apache.org
Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from 
the middle of a record

[EXTERNAL EMAIL]


Hi David,

JSON files are never splittable: there is no single-character way to find the 
start of a JSON record within a file.

Drill is supposed to support two JSON formats: the array format from the 
earlier post, and the non-JSON (but very common) list of objects format in this 
example.

Thanks,
- Paul



On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David 
 wrote:

 Get rid of the opening and closing brackets and see if you can turn the commas 
into newlines.. The file needs to be splittable I think to reduce memory 
overhead vs parsing a giant string...

{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}

-Original Message-
From: scott [mailto:tcots8...@gmail.com]
Sent: Monday, August 27, 2018 4:59 PM
To: user@drill.apache.org
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the 
middle of a record

[EXTERNAL EMAIL]


Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON - Cannot 
read from the middle of a record. Current token was START_ARRAY

The json files are in array format, like [ { "var1": "foo", "var2":
"bar"},{"var1": "fo", "var2": "baz"}]

I found a ticket that indicates this format is not supported by Drill yet,
DRILL-1755 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_DRILL-2D1755=DwIBaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQoFvFKpdJo=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc=>
 , but I find it hard to believe there is no workaround or solution since this 
was reported
4 years back. Does anyone have a solution or workaround to this problem?

Thanks,
Scott


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for 
further information.  Please refer to 
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.


Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

2018-08-28 Thread Paul Rogers
Hi Scott,

I created a file, "test.json", using the data from your e-mail:

[ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]

The oldest build I have readily available is Drill 1.13. I ran that as a 
server, then connected with sqlline as a client. I ran a query:

select * from `test.json`;
+---+---+| var1  | var2  |+---+---+| foo   | bar   || fo    | 
baz   |+---+---+

I can try with Drill 1.12, once I find and download it. Or, you can try with 
Drill 1.14 (the latest release.)

I do wonder, however, if we are talking about the same thing. My test puts your 
JSON in a JSON file with ".json" extension so that Drill choses the JSON 
parser. I'm using default JSON (session) options.

Is this what you are doing? Or, is your JSON coming from some other source? 
Kafka? A field from a CSV file, say?

Thanks,
- Paul

 

On Monday, August 27, 2018, 10:31:00 PM PDT, scott  
wrote:  
 
 Paul,
I'm using version 1.12. Can you tell me what version you think that was
fixed in? The ticket I referenced is still open, with no comments.

Scott

On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers 
wrote:

> Hi David,
>
> JSON files are never splittable: there is no single-character way to find
> the start of a JSON record within a file.
>
> Drill is supposed to support two JSON formats: the array format from the
> earlier post, and the non-JSON (but very common) list of objects format in
> this example.
>
> Thanks,
> - Paul
>
>
>
>    On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
> david@blackrock.com> wrote:
>
>  Get rid of the opening and closing brackets and see if you can turn the
> commas into newlines.. The file needs to be splittable I think to reduce
> memory overhead vs parsing a giant string...
>
> {"var1": "foo", "var2":"bar"}
> {"var1": "fo", "var2": "baz"}
> {"var1": "f2o", "var2": "baz2"}
> {"var1": "f3o", "var2": "baz3"}
> {"var1": "f4o", "var2": "baz4"}
> {"var1": "f5o", "var2": "baz5"}
>
> -Original Message-
> From: scott [mailto:tcots8...@gmail.com]
> Sent: Monday, August 27, 2018 4:59 PM
> To: user@drill.apache.org
> Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the
> middle of a record
>
> [EXTERNAL EMAIL]
>
>
> Hi All,
> I'm getting an error querying some of my json files.
> The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON -
> Cannot read from the middle of a record. Current token was START_ARRAY
>
> The json files are in array format, like [ { "var1": "foo", "var2":
> "bar"},{"var1": "fo", "var2": "baz"}]
>
> I found a ticket that indicates this format is not supported by Drill yet,
> DRILL-1755 <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_DRILL-2D1755=DwIBaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQoFvFKpdJo=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc=>
> , but I find it hard to believe there is no workaround or solution since
> this was reported
> 4 years back. Does anyone have a solution or workaround to this problem?
>
> Thanks,
> Scott
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for
> more information about BlackRock’s Privacy Policy.
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
>
> © 2018 BlackRock, Inc. All rights reserved.
>  

Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

2018-08-27 Thread scott
Paul,
I'm using version 1.12. Can you tell me what version you think that was
fixed in? The ticket I referenced is still open, with no comments.

Scott

On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers 
wrote:

> Hi David,
>
> JSON files are never splittable: there is no single-character way to find
> the start of a JSON record within a file.
>
> Drill is supposed to support two JSON formats: the array format from the
> earlier post, and the non-JSON (but very common) list of objects format in
> this example.
>
> Thanks,
> - Paul
>
>
>
> On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
> david@blackrock.com> wrote:
>
>  Get rid of the opening and closing brackets and see if you can turn the
> commas into newlines.. The file needs to be splittable I think to reduce
> memory overhead vs parsing a giant string...
>
> {"var1": "foo", "var2":"bar"}
> {"var1": "fo", "var2": "baz"}
> {"var1": "f2o", "var2": "baz2"}
> {"var1": "f3o", "var2": "baz3"}
> {"var1": "f4o", "var2": "baz4"}
> {"var1": "f5o", "var2": "baz5"}
>
> -Original Message-
> From: scott [mailto:tcots8...@gmail.com]
> Sent: Monday, August 27, 2018 4:59 PM
> To: user@drill.apache.org
> Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the
> middle of a record
>
> [EXTERNAL EMAIL]
>
>
> Hi All,
> I'm getting an error querying some of my json files.
> The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON -
> Cannot read from the middle of a record. Current token was START_ARRAY
>
> The json files are in array format, like [ { "var1": "foo", "var2":
> "bar"},{"var1": "fo", "var2": "baz"}]
>
> I found a ticket that indicates this format is not supported by Drill yet,
> DRILL-1755 <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_DRILL-2D1755=DwIBaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQoFvFKpdJo=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc=>
> , but I find it hard to believe there is no workaround or solution since
> this was reported
> 4 years back. Does anyone have a solution or workaround to this problem?
>
> Thanks,
> Scott
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for
> more information about BlackRock’s Privacy Policy.
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
>
> © 2018 BlackRock, Inc. All rights reserved.
>


Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

2018-08-27 Thread Paul Rogers
Hi David,

JSON files are never splittable: there is no single-character way to find the 
start of a JSON record within a file.

Drill is supposed to support two JSON formats: the array format from the 
earlier post, and the non-JSON (but very common) list of objects format in this 
example.

Thanks,
- Paul

 

On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David 
 wrote:  
 
 Get rid of the opening and closing brackets and see if you can turn the commas 
into newlines.. The file needs to be splittable I think to reduce memory 
overhead vs parsing a giant string...

{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}

-Original Message-
From: scott [mailto:tcots8...@gmail.com] 
Sent: Monday, August 27, 2018 4:59 PM
To: user@drill.apache.org
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the 
middle of a record

[EXTERNAL EMAIL]


Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON - Cannot 
read from the middle of a record. Current token was START_ARRAY

The json files are in array format, like [ { "var1": "foo", "var2":
"bar"},{"var1": "fo", "var2": "baz"}]

I found a ticket that indicates this format is not supported by Drill yet,
DRILL-1755 

 , but I find it hard to believe there is no workaround or solution since this 
was reported
4 years back. Does anyone have a solution or workaround to this problem?

Thanks,
Scott


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for 
further information.  Please refer to 
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.