Paul, Thank you for the workaround, that worked in my case perfectly !! Scott
On Tue, Aug 28, 2018 at 12:20 PM Lee, David <[email protected]> wrote: > This is a pretty ugly json file.. 568 megs for 7227 records.. > > => ls -l test.jsonl > -rw-r--r-- 1 my_login users 568693075 Aug 28 15:15 test.jsonl > > There is one difference 7226 vs 7227, but that is from wc.. > > wc -l is NOT counting last of the file if it does not have end of line > character > > -----Original Message----- > From: Lee, David > Sent: Tuesday, August 28, 2018 12:11 PM > To: [email protected] > Subject: RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read > from the middle of a record > > select count(*) on a jsonl file comes back instantly > > /u1/my_login=> wc -l test.jsonl > 7226 test.jsonl > > select count(*) from dfs.`/u1/my_login/test.jsonl` > > EXPR$0 > 7227 > > Overview > Operator ID Type Avg Setup Time Max Setup Time Avg Process Time > Max Process Time Min Wait Time Avg Wait Time Max Wait > Time % Fragment Time % Query Time Rows Avg Peak Memory Max Peak > Memory > 00-xx-00 JSON_SUB_SCAN 0.000s 0.000s 1.096s 3.287s 0.000s > 0.181s 0.543s 99.58% 99.58% 7,228 24KB 32KB > 00-xx-01 PROJECT 0.001s 0.001s 0.000s 0.000s 0.000s 0.000s > 0.000s 0.00% 0.00% 1 32KB 32KB > 00-xx-02 STREAMING_AGGREGATE 0.022s 0.022s 0.001s 0.001s > 0.000s 0.000s 0.000s 0.04% 0.04% 1 64KB 64KB > 00-xx-03 STREAMING_AGGREGATE 0.040s 0.040s 0.011s 0.011s > 0.000s 0.000s 0.000s 0.34% 0.34% 7,227 48KB 48KB > 00-xx-04 PROJECT 0.032s 0.032s 0.001s 0.001s 0.000s 0.000s > 0.000s 0.04% 0.04% 7,227 16KB 16KB > > > -----Original Message----- > From: Paul Rogers [mailto:[email protected]] > Sent: Tuesday, August 28, 2018 11:23 AM > To: [email protected] > Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read > from the middle of a record > > [EXTERNAL EMAIL] > > > Hi Scott, > > Bingo. Just tried this very case with the sample file from the previous > post. Got exactly the failure in the post you provided. I notice that a > "select *" query returns immediately, but a "count(*)" query hangs for the > 30+ seconds before it errors out. Mine is only a two-record file, so taking > 30 seconds to fail is excessive. > > Clearly, something is wrong. At the very least, a count(*) should simply > read all records and discard the data, using exactly the same JSON parser > as for a "SELECT *" query. That Drill is not doing so suggests that perhaps > the code is trying to be clever to optimize for the "count(*)" case, and is > doing so incorrectly. > > Here is a clunky workaround: just add a WHERE clause that accepts all > records: > > SELECT COUNT(*) FROM `test.json` WHERE 1 = 1; > +---------+| EXPR$0 |+---------+| 2 |+---------+ > > As it turns out, I'm in the (very slow) process of issuing PRs for a > revised JSON record reader to handle other issues. A side effect of that > change is that the new implementation does use the same parse path for both > the "SELECT *" an "SELECT count(*)" paths. So, even if someone cannot fix > this bug short term, there is a longer-term fix coming. > > Thanks, > - Paul > > > > On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott < > [email protected]> wrote: > > Paul, > Thanks for prompting the right questions. I went back and took another > look at my queries. It turns out that there is some condition that causes > this error when running functions like "count(*)" on the data to cause this > error, where a normal unqualified select does not. I also ran across this > article from MapR that led me to conclude Drill just doesn't support it. > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8&s=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g&e= > > I think if we can confirm exactly which conditions cause the problem, we > should open a high priority Jira. What do you think? > > > On Mon, Aug 27, 2018 at 11:58 PM Paul Rogers <[email protected]> > wrote: > > > Hi Scott, > > > > I created a file, "test.json", using the data from your e-mail: > > > > [ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}] > > > > The oldest build I have readily available is Drill 1.13. I ran that as > > a server, then connected with sqlline as a client. I ran a query: > > > > select * from `test.json`; > > +-------+-------+| var1 | var2 |+-------+-------+| foo | bar || fo > > | baz |+-------+-------+ > > > > I can try with Drill 1.12, once I find and download it. Or, you can > > try with Drill 1.14 (the latest release.) > > > > I do wonder, however, if we are talking about the same thing. My test > > puts your JSON in a JSON file with ".json" extension so that Drill > > choses the JSON parser. I'm using default JSON (session) options. > > > > Is this what you are doing? Or, is your JSON coming from some other > > source? Kafka? A field from a CSV file, say? > > > > Thanks, > > - Paul > > > > > > > > On Monday, August 27, 2018, 10:31:00 PM PDT, scott < > >[email protected]> wrote: > > > > Paul, > > I'm using version 1.12. Can you tell me what version you think that > >was fixed in? The ticket I referenced is still open, with no comments. > > > > Scott > > > > On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers > > <[email protected]> > > wrote: > > > > > Hi David, > > > > > > JSON files are never splittable: there is no single-character way to > > > find the start of a JSON record within a file. > > > > > > Drill is supposed to support two JSON formats: the array format from > > > the earlier post, and the non-JSON (but very common) list of objects > > > format > > in > > > this example. > > > > > > Thanks, > > > - Paul > > > > > > > > > > > > On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David < > > >[email protected]> wrote: > > > > > > Get rid of the opening and closing brackets and see if you can turn > > >the commas into newlines.. The file needs to be splittable I think > > >to reduce memory overhead vs parsing a giant string... > > > > > > {"var1": "foo", "var2":"bar"} > > > {"var1": "fo", "var2": "baz"} > > > {"var1": "f2o", "var2": "baz2"} > > > {"var1": "f3o", "var2": "baz3"} > > > {"var1": "f4o", "var2": "baz4"} > > > {"var1": "f5o", "var2": "baz5"} > > > > > > -----Original Message----- > > > From: scott [mailto:[email protected]] > > > Sent: Monday, August 27, 2018 4:59 PM > > > To: [email protected] > > > Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read > > > from > > the > > > middle of a record > > > > > > [EXTERNAL EMAIL] > > > > > > > > > Hi All, > > > I'm getting an error querying some of my json files. > > > The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON > > > - Cannot read from the middle of a record. Current token was > > > START_ARRAY > > > > > > The json files are in array format, like [ { "var1": "foo", "var2": > > > "bar"},{"var1": "fo", "var2": "baz"}] > > > > > > I found a ticket that indicates this format is not supported by > > > Drill > > yet, > > > DRILL-1755 < > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_j > > ira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTif > > ecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQ > > oFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e= > > > > > > , but I find it hard to believe there is no workaround or solution > > > since this was reported > > > 4 years back. Does anyone have a solution or workaround to this > problem? > > > > > > Thanks, > > > Scott > > > > > > > > > This message may contain information that is confidential or > privileged. > > > If you are not the intended recipient, please advise the sender > > immediately > > > and delete this message. See > > > http://www.blackrock.com/corporate/en-us/compliance/email-disclaimer > > > s > > for > > > further information. Please refer to > > > http://www.blackrock.com/corporate/en-us/compliance/privacy-policy > > > for more information about BlackRock’s Privacy Policy. > > > > > > For a list of BlackRock's office addresses worldwide, see > > > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations. > > > > > > © 2018 BlackRock, Inc. All rights reserved. > > > >
