select count(*) on a jsonl file comes back instantly /u1/my_login=> wc -l test.jsonl 7226 test.jsonl
select count(*) from dfs.`/u1/my_login/test.jsonl` EXPR$0 7227 Overview Operator ID Type Avg Setup Time Max Setup Time Avg Process Time Max Process Time Min Wait Time Avg Wait Time Max Wait Time % Fragment Time % Query Time Rows Avg Peak Memory Max Peak Memory 00-xx-00 JSON_SUB_SCAN 0.000s 0.000s 1.096s 3.287s 0.000s 0.181s 0.543s 99.58% 99.58% 7,228 24KB 32KB 00-xx-01 PROJECT 0.001s 0.001s 0.000s 0.000s 0.000s 0.000s 0.000s 0.00% 0.00% 1 32KB 32KB 00-xx-02 STREAMING_AGGREGATE 0.022s 0.022s 0.001s 0.001s 0.000s 0.000s 0.000s 0.04% 0.04% 1 64KB 64KB 00-xx-03 STREAMING_AGGREGATE 0.040s 0.040s 0.011s 0.011s 0.000s 0.000s 0.000s 0.34% 0.34% 7,227 48KB 48KB 00-xx-04 PROJECT 0.032s 0.032s 0.001s 0.001s 0.000s 0.000s 0.000s 0.04% 0.04% 7,227 16KB 16KB -----Original Message----- From: Paul Rogers [mailto:[email protected]] Sent: Tuesday, August 28, 2018 11:23 AM To: [email protected] Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record [EXTERNAL EMAIL] Hi Scott, Bingo. Just tried this very case with the sample file from the previous post. Got exactly the failure in the post you provided. I notice that a "select *" query returns immediately, but a "count(*)" query hangs for the 30+ seconds before it errors out. Mine is only a two-record file, so taking 30 seconds to fail is excessive. Clearly, something is wrong. At the very least, a count(*) should simply read all records and discard the data, using exactly the same JSON parser as for a "SELECT *" query. That Drill is not doing so suggests that perhaps the code is trying to be clever to optimize for the "count(*)" case, and is doing so incorrectly. Here is a clunky workaround: just add a WHERE clause that accepts all records: SELECT COUNT(*) FROM `test.json` WHERE 1 = 1; +---------+| EXPR$0 |+---------+| 2 |+---------+ As it turns out, I'm in the (very slow) process of issuing PRs for a revised JSON record reader to handle other issues. A side effect of that change is that the new implementation does use the same parse path for both the "SELECT *" an "SELECT count(*)" paths. So, even if someone cannot fix this bug short term, there is a longer-term fix coming. Thanks, - Paul On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott <[email protected]> wrote: Paul, Thanks for prompting the right questions. I went back and took another look at my queries. It turns out that there is some condition that causes this error when running functions like "count(*)" on the data to cause this error, where a normal unqualified select does not. I also ran across this article from MapR that led me to conclude Drill just doesn't support it. https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8&s=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g&e= I think if we can confirm exactly which conditions cause the problem, we should open a high priority Jira. What do you think? On Mon, Aug 27, 2018 at 11:58 PM Paul Rogers <[email protected]> wrote: > Hi Scott, > > I created a file, "test.json", using the data from your e-mail: > > [ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}] > > The oldest build I have readily available is Drill 1.13. I ran that as > a server, then connected with sqlline as a client. I ran a query: > > select * from `test.json`; > +-------+-------+| var1 | var2 |+-------+-------+| foo | bar || fo > | baz |+-------+-------+ > > I can try with Drill 1.12, once I find and download it. Or, you can > try with Drill 1.14 (the latest release.) > > I do wonder, however, if we are talking about the same thing. My test > puts your JSON in a JSON file with ".json" extension so that Drill > choses the JSON parser. I'm using default JSON (session) options. > > Is this what you are doing? Or, is your JSON coming from some other > source? Kafka? A field from a CSV file, say? > > Thanks, > - Paul > > > > On Monday, August 27, 2018, 10:31:00 PM PDT, scott < >[email protected]> wrote: > > Paul, > I'm using version 1.12. Can you tell me what version you think that >was fixed in? The ticket I referenced is still open, with no comments. > > Scott > > On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers > <[email protected]> > wrote: > > > Hi David, > > > > JSON files are never splittable: there is no single-character way to > > find the start of a JSON record within a file. > > > > Drill is supposed to support two JSON formats: the array format from > > the earlier post, and the non-JSON (but very common) list of objects > > format > in > > this example. > > > > Thanks, > > - Paul > > > > > > > > On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David < > >[email protected]> wrote: > > > > Get rid of the opening and closing brackets and see if you can turn > >the commas into newlines.. The file needs to be splittable I think > >to reduce memory overhead vs parsing a giant string... > > > > {"var1": "foo", "var2":"bar"} > > {"var1": "fo", "var2": "baz"} > > {"var1": "f2o", "var2": "baz2"} > > {"var1": "f3o", "var2": "baz3"} > > {"var1": "f4o", "var2": "baz4"} > > {"var1": "f5o", "var2": "baz5"} > > > > -----Original Message----- > > From: scott [mailto:[email protected]] > > Sent: Monday, August 27, 2018 4:59 PM > > To: [email protected] > > Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read > > from > the > > middle of a record > > > > [EXTERNAL EMAIL] > > > > > > Hi All, > > I'm getting an error querying some of my json files. > > The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON > > - Cannot read from the middle of a record. Current token was > > START_ARRAY > > > > The json files are in array format, like [ { "var1": "foo", "var2": > > "bar"},{"var1": "fo", "var2": "baz"}] > > > > I found a ticket that indicates this format is not supported by > > Drill > yet, > > DRILL-1755 < > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_j > ira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTif > ecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQ > oFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e= > > > > , but I find it hard to believe there is no workaround or solution > > since this was reported > > 4 years back. Does anyone have a solution or workaround to this problem? > > > > Thanks, > > Scott > > > > > > This message may contain information that is confidential or privileged. > > If you are not the intended recipient, please advise the sender > immediately > > and delete this message. See > > http://www.blackrock.com/corporate/en-us/compliance/email-disclaimer > > s > for > > further information. Please refer to > > http://www.blackrock.com/corporate/en-us/compliance/privacy-policy > > for more information about BlackRock’s Privacy Policy. > > > > For a list of BlackRock's office addresses worldwide, see > > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations. > > > > © 2018 BlackRock, Inc. All rights reserved. > >
