RE: Drill Query Engine for nested json or parquet

Lee, David Mon, 03 Jul 2017 10:07:51 -0700

Ok thanks, but now I got a new odd problem which I haven't been able to debug..

SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 (expected: 
range(0, 16384))

After unwrapping the JSON records from a nested dictionary, 90% of my JSON 
files convert successfully to parquet, but 10% are getting the error above..

I've tried to narrow down which rows in my JSON file are problematic, but it is 
both consistent and random.. Removing 1000s records here and there fixes the 
file, but it must be a combination of multiple records causing the error. This 
error never showed up when I converted the wrapped JSON to parquet..

Any idea how to get to the root of this one?

-----Original Message-----
From: Jinfeng Ni [mailto:[email protected]] 
Sent: Wednesday, June 28, 2017 11:42 AM
To: user <[email protected]>
Subject: Re: Drill Query Engine for nested json or parquet

You are right that Q2 has to first apply flatten the field "details" before 
applying filter on top of the flattened rows.  I do not think Drill currently 
supports to push filter down.

The difference between Q1 and Q2:  Q1 is querying a 4-row table, while Q2 is 
querying a 1-row table; the flatten operator produces 4 rows on top of which 
the filter is applied.

On Wed, Jun 28, 2017 at 9:51 AM, Lee, David <[email protected]> wrote:

>
> Is there a fundamental difference between the following queries? I 
> can't get the second example working with parquet files which contain 
> 400,000+ nested records..
>
> It seems like the system wants to flatten every possible record before 
> applying the SQL Where clause to the flattened data structure..
>
> Example 1:
>
> select b.* from dfs.`test1.json` b where b.item  = 3
>
> [
>     {
>       "item": 1,
>       "item_name": "name_for_1"
>     },
>     {
>       "item": 2,
>       "item_name": "name_for_2"
>     },
>     {
>       "item": 3,
>       "item_name": "name_for_3"
>     },
>     {
>       "item": 4,
>       "item_name": "name_for_4"
>     }
> ]
>
> Example 2:
>
> select b.* from
> (select flatten(a.details) as details
> from dfs.`test2.json` a) b
> where b.details.item  = 3
>
> {
>   "header": "my_header_info",
>   "details": [
>     {
>       "item": 1,
>       "item_name": "name_for_1"
>     },
>     {
>       "item": 2,
>       "item_name": "name_for_2"
>     },
>     {
>       "item": 3,
>       "item_name": "name_for_3"
>     },
>     {
>       "item": 4,
>       "item_name": "name_for_4"
>     }
>   ]
> }
>
>
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender 
> immediately and delete this message. See http://www.blackrock.com/ 
> corporate/en-us/compliance/email-disclaimers for further information.
> Please refer to http://www.blackrock.com/corporate/en-us/compliance/
> privacy-policy for more information about BlackRock’s Privacy Policy.
>
> For a list of BlackRock's office addresses worldwide, see 
> http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
>
> © 2017 BlackRock, Inc. All rights reserved.
>

RE: Drill Query Engine for nested json or parquet

Reply via email to