Re: Filter out empty arrays in JSON

Hao Zhu Wed, 21 Jan 2015 17:06:05 -0800

I figured out the differences after getting the json file from Andries.

My json file is like:
{“entities”:{xxx},“entities”:{yyy}... }


Andries' json file is like:
{“entities”:{xxx}}
{“entities”:{yyy}}
...

So basically Andries' json file is contains multiple json files.

Fortunately, we can use the same SQL to get the same results:
0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
+------------+
|   EXPR$0   |
+------------+
| []         |
| [{"text":"GoPatriots"}] |
| [{"text":"aaa"}] |
| []         |
| [{"text":"bbb"}] |
+------------+
5 rows selected (0.136 seconds)
0: jdbc:drill:> with tmp as
. . . . . . . > (
. . . . . . . > select flatten(t.entities.hashtags) as c from
dfs.tmp.`z2.json` t
. . . . . . . > )
. . . . . . . > select tmp.c.text from tmp;
+------------+
|   EXPR$0   |
+------------+
| GoPatriots |
| aaa        |
| bbb        |
+------------+
3 rows selected (0.115 seconds)

Thanks,
Hao






On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
[email protected]> wrote:

> In my case it returns the empty records when flatten is not used.
>
> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags
> from `twitter.json` t limit 10;
> +------------+
> |  hashtags  |
> +------------+
> | []         |
> | [{"text":"SportsNews","indices":[0,11]}] |
> | []         |
> | [{"text":"SportsNews","indices":[0,11]}] |
> | []         |
> | []         |
> | []         |
> | []         |
> | []         |
> | [{"text":"CARvsSEA","indices":[36,45]}] |
> +------------+
>
> On Jan 21, 2015, at 4:26 PM, Hao Zhu <[email protected]> wrote:
>
> > Actually not due to flatten, if you directly query the file, it will only
> > show the non-null values.
> >
> > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> > +------------+
> > 1 row selected (0.123 seconds)
> >
> > Thanks,
> > Hao
> >
> > On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> > [email protected]> wrote:
> >
> >> Very interesting, flatten seems to bypass empty records. Not sure if
> that
> >> is an ideal result for all use cases, but certainly usable in this case.
> >>
> >> Thanks
> >> —Andries
> >>
> >>
> >> On Jan 21, 2015, at 3:45 PM, Hao Zhu <[email protected]> wrote:
> >>
> >>> I can also fetch non-null values for attached json file which contains
> 6
> >> "entities", 3 of them are null, 3 of them have "text" value.
> >>>
> >>> Could you share your complete "twitter.json"?
> >>>
> >>> 0: jdbc:drill:> with tmp as
> >>> . . . . . . . > (
> >>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> >> dfs.tmp.`z.json` t
> >>> . . . . . . . > )
> >>> . . . . . . . > select tmp.c.text from tmp;
> >>> +------------+
> >>> |   EXPR$0   |
> >>> +------------+
> >>> | GoPatriots |
> >>> | aaa        |
> >>> | bbb        |
> >>> +------------+
> >>> 3 rows selected (0.122 seconds)
> >>>
> >>> Thanks,
> >>> Hao
> >>>
> >>>
> >>>
> >>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> >> [email protected]> wrote:
> >>> The sample data I posted only has 1 element, but some records have
> >> multiple elements in them.
> >>>
> >>> Interestingly enough though
> >>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
> >>>
> >>> Produces
> >>> +------------+
> >>> |   EXPR$0   |
> >>> +------------+
> >>> | null       |
> >>> | SportsNews |
> >>> | null       |
> >>> | SportsNews |
> >>> | null       |
> >>> | null       |
> >>> | null       |
> >>> | null       |
> >>> | null       |
> >>> | CARvsSEA   |
> >>> +——————+
> >>>
> >>>
> >>> And
> >>>
> >>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> >>>
> >>> +------------+
> >>> |   EXPR$0   |
> >>> +------------+
> >>> | {"indices":[]} |
> >>> | {"text":"SportsNews","indices":[90,99]} |
> >>> | {"indices":[]} |
> >>> | {"text":"SportsNews","indices":[90,99]} |
> >>> | {"indices":[]} |
> >>> | {"indices":[]} |
> >>> | {"indices":[]} |
> >>> | {"indices":[]} |
> >>> | {"indices":[]} |
> >>> | {"text":"CARvsSEA","indices":[90,99]} |
> >>> +——————+
> >>>
> >>> Strange part is that there is no indices map in the hashtags array, so
> >> no idea why it shows up when pointing to the first lament in an empty
> array.
> >>>
> >>>
> >>> —Andries
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <[email protected]> wrote:
> >>>
> >>>> I just noticed that the result of "hashtags" is just an array with
> >> only 1
> >>>> element.
> >>>> So take your example:
> >>>> [root@maprdemo tmp]# cat d.json
> >>>> {
> >>>> "entities": {
> >>>>  "trends": [],
> >>>>  "symbols": [],
> >>>>  "urls": [],
> >>>>  "hashtags": [],
> >>>>  "user_mentions": []
> >>>> },
> >>>> "entities": {
> >>>>  "trends": [1,2,3],
> >>>>  "symbols": [4,5,6],
> >>>>  "urls": [7,8,9],
> >>>>  "hashtags": [
> >>>>    {
> >>>>      "text": "GoPatriots",
> >>>>      "indices": []
> >>>>    }
> >>>>  ],
> >>>>  "user_mentions": []
> >>>> }
> >>>> }
> >>>>
> >>>> Now we can do this to achieve the results:
> >>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> >>>> +------------+
> >>>> |   EXPR$0   |
> >>>> +------------+
> >>>> | [{"text":"GoPatriots"}] |
> >>>> +------------+
> >>>> 1 row selected (0.09 seconds)
> >>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
> >> dfs.tmp.`d.json` t ;
> >>>> +------------+
> >>>> |   EXPR$0   |
> >>>> +------------+
> >>>> | GoPatriots |
> >>>> +------------+
> >>>> 1 row selected (0.108 seconds)
> >>>>
> >>>> Thanks,
> >>>> Hao
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> When I run the query on a larger dataset it actually show the empty
> >>>>> records.
> >>>>>
> >>>>> select t.entities.hashtags from `twitter.json` t limit 10;
> >>>>>
> >>>>> +------------+
> >>>>> |   EXPR$0   |
> >>>>> +------------+
> >>>>> | []         |
> >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>> | []         |
> >>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>> | []         |
> >>>>> | []         |
> >>>>> | []         |
> >>>>> | []         |
> >>>>> | []         |
> >>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >>>>> +------------+
> >>>>> 10 rows selected (2.899 seconds)
> >>>>>
> >>>>>
> >>>>>
> >>>>> However having the output as maps is not very useful, unless i can
> >> filter
> >>>>> out the records with empty arrays and then drill deeper into the ones
> >> with
> >>>>> data in the arrays.
> >>>>>
> >>>>> BTW: Hao I would have expiated your query to return both rows, one
> >> with an
> >>>>> empty array as above and the other with the array data.
> >>>>>
> >>>>>
> >>>>> —Andries
> >>>>>
> >>>>>
> >>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <[email protected]> wrote:
> >>>>>
> >>>>>> I am not sure if below is expected behavior.
> >>>>>> If we only select "hashtags", and it will return only 1 row ignoring
> >> the
> >>>>>> null value.
> >>>>>> However then if we try to get "hashtags.text", it fails...which
> >> means it
> >>>>> is
> >>>>>> still trying to read the NULL value.
> >>>>>> I am thinking it may confuse the SQL developers.
> >>>>>>
> >>>>>>
> >>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
> >>>>>> +------------+
> >>>>>> |   EXPR$0   |
> >>>>>> +------------+
> >>>>>> | [{"text":"GoPatriots"}] |
> >>>>>> +------------+
> >>>>>> 1 row selected (0.109 seconds)
> >>>>>>
> >>>>>>
> >>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
> >> dfs.tmp.`d.json` t ;
> >>>>>>
> >>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> >> cast to
> >>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>>
> >>>>>>
> >>>>>> Error: exception while executing query: Failure while executing
> >> query.
> >>>>>> (state=,code=0)
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Hao
> >>>>>>
> >>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>> Now try on hashtags with the following:
> >>>>>>>
> >>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
> >> `/twitter.json` t
> >>>>>>> where t.entities.hashtags is not null limit 10;
> >>>>>>>
> >>>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> >> cast to
> >>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>>
> >>>>>>>
> >>>>>>> Error: exception while executing query: Failure while executing
> >> query.
> >>>>>>> (state=,code=0)
> >>>>>>>
> >>>>>>>
> >>>>>>> {
> >>>>>>> "entities": {
> >>>>>>> "trends": [],
> >>>>>>> "symbols": [],
> >>>>>>> "urls": [],
> >>>>>>> "hashtags": [],
> >>>>>>> "user_mentions": []
> >>>>>>> },
> >>>>>>> "entities": {
> >>>>>>> "trends": [1,2,3],
> >>>>>>> "symbols": [4,5,6],
> >>>>>>> "urls": [7,8,9],
> >>>>>>> "hashtags": [
> >>>>>>>   {
> >>>>>>>     "text": "GoPatriots",
> >>>>>>>     "indices": []
> >>>>>>>   }
> >>>>>>> ],
> >>>>>>> "user_mentions": []
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> The issue seems to be that if some records have arrays with maps in
> >> them
> >>>>>>> and others are empty.
> >>>>>>>
> >>>>>>> —Andries
> >>>>>>>
> >>>>>>>
> >>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <[email protected]> wrote:
> >>>>>>>
> >>>>>>>> Seems it works for below json file:
> >>>>>>>> {
> >>>>>>>> "entities": {
> >>>>>>>> "trends": [],
> >>>>>>>> "symbols": [],
> >>>>>>>> "urls": [],
> >>>>>>>> "hashtags": [
> >>>>>>>>   {
> >>>>>>>>     "text": "GoPatriots",
> >>>>>>>>     "indices": [
> >>>>>>>>       83,
> >>>>>>>>       94
> >>>>>>>>     ]
> >>>>>>>>   }
> >>>>>>>> ],
> >>>>>>>> "user_mentions": []
> >>>>>>>> },
> >>>>>>>> "entities": {
> >>>>>>>> "trends": [1,2,3],
> >>>>>>>> "symbols": [4,5,6],
> >>>>>>>> "urls": [7,8,9],
> >>>>>>>> "hashtags": [
> >>>>>>>>   {
> >>>>>>>>     "text": "GoPatriots",
> >>>>>>>>     "indices": []
> >>>>>>>>   }
> >>>>>>>> ],
> >>>>>>>> "user_mentions": []
> >>>>>>>> }
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
> >> where
> >>>>>>>> t.entities.urls is not null;
> >>>>>>>> +------------+
> >>>>>>>> |   EXPR$0   |
> >>>>>>>> +------------+
> >>>>>>>> | [7,8,9]    |
> >>>>>>>> +------------+
> >>>>>>>> 1 row selected (0.139 seconds)
> >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
> >> where
> >>>>>>>> t.entities.urls is null;
> >>>>>>>> +------------+
> >>>>>>>> |   EXPR$0   |
> >>>>>>>> +------------+
> >>>>>>>> +------------+
> >>>>>>>> No rows selected (0.158 seconds)
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Hao
> >>>>>>>>
> >>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <[email protected]>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> I believe that this works if the array contains homogeneous
> >> primitive
> >>>>>>>>> types. In your example, it appears from the error, the array
> field
> >>>>>>> 'member'
> >>>>>>>>> contained maps for at least one record.
> >>>>>>>>>
> >>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> >> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
> >>>>>>>>>>
> >>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> >>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> >>>>>>>>>> Query failed: Query stopped., Failure while trying to
> materialize
> >>>>>>>>> incoming schema.  Errors:
> >>>>>>>>>>
> >>>>>>>>>> Error in expression at index -1.  Error: Missing function
> >>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
> >>>>>>> --UNKNOWN
> >>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> >>>>>>> bullseye-3:31010 ]
> >>>>>>>>>>
> >>>>>>>>>> Error: exception while executing query: Failure while executing
> >>>>> query.
> >>>>>>>>> (state=,code=0)
> >>>>>>>>>>
> >>>>>>>>>> 
> >>>>>>>>>>
> >>>>>>>>>> Chris Matta
> >>>>>>>>>> [email protected]
> >>>>>>>>>> 215-701-3146
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
> [email protected]
> >>>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> repeated_count('entities.urls') > 0
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> >>>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> How do you filter out records with an empty array in drill?
> >>>>>>>>>>>> i.e some records have "url":[]  and some will have an array
> >> with
> >>>>> data
> >>>>>>>>> in
> >>>>>>>>>>>> it. When trying to read records with data in the array drill
> >> fails
> >>>>>>> due
> >>>>>>>>>>> to
> >>>>>>>>>>>> records missing any data in the array. Trying a filter with/*
> >> where
> >>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is
> >> not
> >>>>>>>>> null.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Note some of the arrays contains maps, using twitter data as
> an
> >>>>>>>>> example
> >>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[]
> and
> >>>>>>> others
> >>>>>>>>>>> will
> >>>>>>>>>>>> look similar to what is listed below.
> >>>>>>>>>>>>
> >>>>>>>>>>>> "entities": {
> >>>>>>>>>>>> "trends": [],
> >>>>>>>>>>>> "symbols": [],
> >>>>>>>>>>>> "urls": [],
> >>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>   {
> >>>>>>>>>>>>     "text": "GoPatriots",
> >>>>>>>>>>>>     "indices": [
> >>>>>>>>>>>>       83,
> >>>>>>>>>>>>       94
> >>>>>>>>>>>>     ]
> >>>>>>>>>>>>   }
> >>>>>>>>>>>> ],
> >>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>> },
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks
> >>>>>>>>>>>> —Andries
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>> <z.json>
> >>
> >>
>
>

Re: Filter out empty arrays in JSON

Reply via email to