RE: Drill Query Engine for nested json or parquet

Lee, David Wed, 05 Jul 2017 17:50:13 -0700

Unfortunately the files A) contain licensed data B) are 403,208,254 bytes of 
gzipped JSON per file which contain roughly 6,000 JSON records with a mix of 
~4,000 nested arrays and objects (up to 7 levels deep).


Running Drill SQL on a sample of individual JSON keys works so it's probably 
some particular JSON key value out of the ~4,000 which is causing the IOBE. The 
files which are failing are from Africa, Asia and Australia so I have a feeling 
it's some regional utf8 / Unicode text values which is the problem.

-----Original Message-----
From: Jinfeng Ni [mailto:[email protected]] 
Sent: Monday, July 03, 2017 3:39 PM
To: user <[email protected]>
Subject: Re: Drill Query Engine for nested json or parquet

This IOBE is clearly indicates a code bug. From the error message, it seems to 
be similar to DRILL-1743, which was marked as fixed. Could you please open a 
JIRA, and post the query / sample data you used (if it does not contain 
sensitive info)? Having a reproduce for this would make it easier for people to 
take a look. Thanks.


1. 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D1743&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=IZ97vmUqJvkHR_Xi9GS-CmRIrqr1MSe6379wL6zbZbM&s=A-_qhmjmgndvzahLAAdfUZDxUHNEft27SgzsC5OPC_E&e=
 

On Mon, Jul 3, 2017 at 10:07 AM, Lee, David <[email protected]> wrote:

> Ok thanks, but now I got a new odd problem which I haven't been able 
> to debug..
>
> SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4
> (expected: range(0, 16384))
>
> After unwrapping the JSON records from a nested dictionary, 90% of my 
> JSON files convert successfully to parquet, but 10% are getting the error 
> above..
>
> I've tried to narrow down which rows in my JSON file are problematic, 
> but it is both consistent and random.. Removing 1000s records here and 
> there fixes the file, but it must be a combination of multiple records 
> causing the error. This error never showed up when I converted the 
> wrapped JSON to parquet..
>
> Any idea how to get to the root of this one?
>
> -----Original Message-----
> From: Jinfeng Ni [mailto:[email protected]]
> Sent: Wednesday, June 28, 2017 11:42 AM
> To: user <[email protected]>
> Subject: Re: Drill Query Engine for nested json or parquet
>
> You are right that Q2 has to first apply flatten the field "details"
> before applying filter on top of the flattened rows.  I do not think 
> Drill currently supports to push filter down.
>
> The difference between Q1 and Q2:  Q1 is querying a 4-row table, while 
> Q2 is querying a 1-row table; the flatten operator produces 4 rows on 
> top of which the filter is applied.
>
>
> On Wed, Jun 28, 2017 at 9:51 AM, Lee, David <[email protected]>
> wrote:
>
> >
> > Is there a fundamental difference between the following queries? I 
> > can't get the second example working with parquet files which 
> > contain 400,000+ nested records..
> >
> > It seems like the system wants to flatten every possible record 
> > before applying the SQL Where clause to the flattened data structure..
> >
> > Example 1:
> >
> > select b.* from dfs.`test1.json` b where b.item  = 3
> >
> > [
> >     {
> >       "item": 1,
> >       "item_name": "name_for_1"
> >     },
> >     {
> >       "item": 2,
> >       "item_name": "name_for_2"
> >     },
> >     {
> >       "item": 3,
> >       "item_name": "name_for_3"
> >     },
> >     {
> >       "item": 4,
> >       "item_name": "name_for_4"
> >     }
> > ]
> >
> > Example 2:
> >
> > select b.* from
> > (select flatten(a.details) as details from dfs.`test2.json` a) b 
> > where b.details.item  = 3
> >
> > {
> >   "header": "my_header_info",
> >   "details": [
> >     {
> >       "item": 1,
> >       "item_name": "name_for_1"
> >     },
> >     {
> >       "item": 2,
> >       "item_name": "name_for_2"
> >     },
> >     {
> >       "item": 3,
> >       "item_name": "name_for_3"
> >     },
> >     {
> >       "item": 4,
> >       "item_name": "name_for_4"
> >     }
> >   ]
> > }
> >
> >
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender 
> > immediately and delete this message. See http://www.blackrock.com/ 
> > corporate/en-us/compliance/email-disclaimers for further information.
> > Please refer to http://www.blackrock.com/corporate/en-us/compliance/
> > privacy-policy for more information about BlackRock’s Privacy Policy.
> >
> > For a list of BlackRock's office addresses worldwide, see 
> > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
> >
> > © 2017 BlackRock, Inc. All rights reserved.
> >
>

RE: Drill Query Engine for nested json or parquet

Reply via email to