RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

Lee, David Tue, 28 Aug 2018 12:11:56 -0700

select count(*) on a jsonl file comes back instantly

/u1/my_login=> wc -l test.jsonl
7226 test.jsonl


select count(*) from dfs.`/u1/my_login/test.jsonl`  

EXPR$0
7227

Overview
Operator ID     Type    Avg Setup Time  Max Setup Time  Avg Process Time        
Max Process Time        Min Wait Time   Avg Wait Time   Max Wait Time   % 
Fragment Time % Query Time    Rows    Avg Peak Memory Max Peak Memory
00-xx-00        JSON_SUB_SCAN   0.000s  0.000s  1.096s  3.287s  0.000s  0.181s  
0.543s  99.58%  99.58%  7,228   24KB    32KB
00-xx-01        PROJECT 0.001s  0.001s  0.000s  0.000s  0.000s  0.000s  0.000s  
0.00%   0.00%   1       32KB    32KB
00-xx-02        STREAMING_AGGREGATE     0.022s  0.022s  0.001s  0.001s  0.000s  
0.000s  0.000s  0.04%   0.04%   1       64KB    64KB
00-xx-03        STREAMING_AGGREGATE     0.040s  0.040s  0.011s  0.011s  0.000s  
0.000s  0.000s  0.34%   0.34%   7,227   48KB    48KB
00-xx-04        PROJECT 0.032s  0.032s  0.001s  0.001s  0.000s  0.000s  0.000s  
0.04%   0.04%   7,227   16KB    16KB


-----Original Message-----
From: Paul Rogers [mailto:[email protected]] 
Sent: Tuesday, August 28, 2018 11:23 AM
To: [email protected]
Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from 
the middle of a record

[EXTERNAL EMAIL]


Hi Scott,

Bingo. Just tried this very case with the sample file from the previous post. 
Got exactly the failure in the post you provided. I notice that a "select *" 
query returns immediately, but a "count(*)" query hangs for the 30+ seconds 
before it errors out. Mine is only a two-record file, so taking 30 seconds to 
fail is excessive.

Clearly, something is wrong. At the very least, a count(*) should simply read 
all records and discard the data, using exactly the same JSON parser as for a 
"SELECT *" query. That Drill is not doing so suggests that perhaps the code is 
trying to be clever to optimize for the "count(*)" case, and is doing so 
incorrectly.

Here is a clunky workaround: just add a WHERE clause that accepts all records:

SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
+---------+| EXPR$0  |+---------+| 2       |+---------+

As it turns out, I'm in the (very slow) process of issuing PRs for a revised 
JSON record reader to handle other issues. A side effect of that change is that 
the new implementation does use the same parse path for both the "SELECT *" an 
"SELECT count(*)" paths. So, even if someone cannot fix this bug short term, 
there is a longer-term fix coming.

Thanks,
- Paul



    On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott <[email protected]> 
wrote:

 Paul,
Thanks for prompting the right questions. I went back and took another look at 
my queries. It turns out that there is some condition that causes this error 
when running functions like "count(*)" on the data to cause this error, where a 
normal unqualified select does not. I also ran across this article from MapR 
that led me to conclude Drill just doesn't support it.

https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8&s=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g&e=

I think if we can confirm exactly which conditions cause the problem, we should 
open a high priority Jira. What do you think?


On Mon, Aug 27, 2018 at 11:58 PM Paul Rogers <[email protected]>
wrote:

> Hi Scott,
>
> I created a file, "test.json", using the data from your e-mail:
>
> [ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
>
> The oldest build I have readily available is Drill 1.13. I ran that as 
> a server, then connected with sqlline as a client. I ran a query:
>
> select * from `test.json`;
> +-------+-------+| var1  | var2  |+-------+-------+| foo  | bar  || fo
>  | baz  |+-------+-------+
>
> I can try with Drill 1.12, once I find and download it. Or, you can 
> try with Drill 1.14 (the latest release.)
>
> I do wonder, however, if we are talking about the same thing. My test 
> puts your JSON in a JSON file with ".json" extension so that Drill 
> choses the JSON parser. I'm using default JSON (session) options.
>
> Is this what you are doing? Or, is your JSON coming from some other 
> source? Kafka? A field from a CSV file, say?
>
> Thanks,
> - Paul
>
>
>
>    On Monday, August 27, 2018, 10:31:00 PM PDT, scott <  
>[email protected]> wrote:
>
>  Paul,
> I'm using version 1.12. Can you tell me what version you think that 
>was  fixed in? The ticket I referenced is still open, with no comments.
>
> Scott
>
> On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers 
> <[email protected]>
> wrote:
>
> > Hi David,
> >
> > JSON files are never splittable: there is no single-character way to 
> > find the start of a JSON record within a file.
> >
> > Drill is supposed to support two JSON formats: the array format from 
> > the earlier post, and the non-JSON (but very common) list of objects 
> > format
> in
> > this example.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <  
> >[email protected]> wrote:
> >
> >  Get rid of the opening and closing brackets and see if you can turn 
> >the  commas into newlines.. The file needs to be splittable I think 
> >to reduce  memory overhead vs parsing a giant string...
> >
> > {"var1": "foo", "var2":"bar"}
> > {"var1": "fo", "var2": "baz"}
> > {"var1": "f2o", "var2": "baz2"}
> > {"var1": "f3o", "var2": "baz3"}
> > {"var1": "f4o", "var2": "baz4"}
> > {"var1": "f5o", "var2": "baz5"}
> >
> > -----Original Message-----
> > From: scott [mailto:[email protected]]
> > Sent: Monday, August 27, 2018 4:59 PM
> > To: [email protected]
> > Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read 
> > from
> the
> > middle of a record
> >
> > [EXTERNAL EMAIL]
> >
> >
> > Hi All,
> > I'm getting an error querying some of my json files.
> > The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON 
> > - Cannot read from the middle of a record. Current token was 
> > START_ARRAY
> >
> > The json files are in array format, like [ { "var1": "foo", "var2":
> > "bar"},{"var1": "fo", "var2": "baz"}]
> >
> > I found a ticket that indicates this format is not supported by 
> > Drill
> yet,
> > DRILL-1755 <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_j
> ira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTif
> ecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQ
> oFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=
> >
> > , but I find it hard to believe there is no workaround or solution 
> > since this was reported
> > 4 years back. Does anyone have a solution or workaround to this problem?
> >
> > Thanks,
> > Scott
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> immediately
> > and delete this message. See
> > http://www.blackrock.com/corporate/en-us/compliance/email-disclaimer
> > s
> for
> > further information.  Please refer to 
> > http://www.blackrock.com/corporate/en-us/compliance/privacy-policy 
> > for more information about BlackRock’s Privacy Policy.
> >
> > For a list of BlackRock's office addresses worldwide, see 
> > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
> >
> > © 2018 BlackRock, Inc. All rights reserved.
> >

RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

Reply via email to