Re: Filter out empty arrays in JSON

Andries Engelbrecht Mon, 26 Jan 2015 15:28:49 -0800

Unfortunately it seems that with larger data sets the use of flatten seems to 
produce an error.


Any other options to filter out JSON records (objects) where an array is empty?

Thanks
—Andries


On Jan 21, 2015, at 5:18 PM, Andries Engelbrecht <[email protected]> 
wrote:

> It was interesting to see flatten bypass the records with an empty array.
> 
> Unfortunately some arrays are much more complex than the example here, and it 
> is still useful to have the ability to filter out ones that are empty.
> 
> —Andries
> 
> On Jan 21, 2015, at 5:04 PM, Hao Zhu <[email protected]> wrote:
> 
>> I figured out the differences after getting the json file from Andries.
>> 
>> My json file is like:
>> {“entities”:{xxx},“entities”:{yyy}... }
>> 
>> Andries' json file is like:
>> {“entities”:{xxx}}
>> {“entities”:{yyy}}
>> ...
>> 
>> So basically Andries' json file is contains multiple json files.
>> 
>> Fortunately, we can use the same SQL to get the same results:
>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
>> +------------+
>> |   EXPR$0   |
>> +------------+
>> | []         |
>> | [{"text":"GoPatriots"}] |
>> | [{"text":"aaa"}] |
>> | []         |
>> | [{"text":"bbb"}] |
>> +------------+
>> 5 rows selected (0.136 seconds)
>> 0: jdbc:drill:> with tmp as
>> . . . . . . . > (
>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>> dfs.tmp.`z2.json` t
>> . . . . . . . > )
>> . . . . . . . > select tmp.c.text from tmp;
>> +------------+
>> |   EXPR$0   |
>> +------------+
>> | GoPatriots |
>> | aaa        |
>> | bbb        |
>> +------------+
>> 3 rows selected (0.115 seconds)
>> 
>> Thanks,
>> Hao
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
>> [email protected]> wrote:
>> 
>>> In my case it returns the empty records when flatten is not used.
>>> 
>>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags
>>> from `twitter.json` t limit 10;
>>> +------------+
>>> |  hashtags  |
>>> +------------+
>>> | []         |
>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>> | []         |
>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>> | []         |
>>> | []         |
>>> | []         |
>>> | []         |
>>> | []         |
>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>> +------------+
>>> 
>>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <[email protected]> wrote:
>>> 
>>>> Actually not due to flatten, if you directly query the file, it will only
>>>> show the non-null values.
>>>> 
>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
>>>> +------------+
>>>> |   EXPR$0   |
>>>> +------------+
>>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
>>>> +------------+
>>>> 1 row selected (0.123 seconds)
>>>> 
>>>> Thanks,
>>>> Hao
>>>> 
>>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
>>>> [email protected]> wrote:
>>>> 
>>>>> Very interesting, flatten seems to bypass empty records. Not sure if
>>> that
>>>>> is an ideal result for all use cases, but certainly usable in this case.
>>>>> 
>>>>> Thanks
>>>>> —Andries
>>>>> 
>>>>> 
>>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <[email protected]> wrote:
>>>>> 
>>>>>> I can also fetch non-null values for attached json file which contains
>>> 6
>>>>> "entities", 3 of them are null, 3 of them have "text" value.
>>>>>> 
>>>>>> Could you share your complete "twitter.json"?
>>>>>> 
>>>>>> 0: jdbc:drill:> with tmp as
>>>>>> . . . . . . . > (
>>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>>>>> dfs.tmp.`z.json` t
>>>>>> . . . . . . . > )
>>>>>> . . . . . . . > select tmp.c.text from tmp;
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | GoPatriots |
>>>>>> | aaa        |
>>>>>> | bbb        |
>>>>>> +------------+
>>>>>> 3 rows selected (0.122 seconds)
>>>>>> 
>>>>>> Thanks,
>>>>>> Hao
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
>>>>> [email protected]> wrote:
>>>>>> The sample data I posted only has 1 element, but some records have
>>>>> multiple elements in them.
>>>>>> 
>>>>>> Interestingly enough though
>>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
>>>>>> 
>>>>>> Produces
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | null       |
>>>>>> | SportsNews |
>>>>>> | null       |
>>>>>> | SportsNews |
>>>>>> | null       |
>>>>>> | null       |
>>>>>> | null       |
>>>>>> | null       |
>>>>>> | null       |
>>>>>> | CARvsSEA   |
>>>>>> +——————+
>>>>>> 
>>>>>> 
>>>>>> And
>>>>>> 
>>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
>>>>>> 
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | {"indices":[]} |
>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"indices":[]} |
>>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
>>>>>> +——————+
>>>>>> 
>>>>>> Strange part is that there is no indices map in the hashtags array, so
>>>>> no idea why it shows up when pointing to the first lament in an empty
>>> array.
>>>>>> 
>>>>>> 
>>>>>> —Andries
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <[email protected]> wrote:
>>>>>> 
>>>>>>> I just noticed that the result of "hashtags" is just an array with
>>>>> only 1
>>>>>>> element.
>>>>>>> So take your example:
>>>>>>> [root@maprdemo tmp]# cat d.json
>>>>>>> {
>>>>>>> "entities": {
>>>>>>> "trends": [],
>>>>>>> "symbols": [],
>>>>>>> "urls": [],
>>>>>>> "hashtags": [],
>>>>>>> "user_mentions": []
>>>>>>> },
>>>>>>> "entities": {
>>>>>>> "trends": [1,2,3],
>>>>>>> "symbols": [4,5,6],
>>>>>>> "urls": [7,8,9],
>>>>>>> "hashtags": [
>>>>>>>  {
>>>>>>>    "text": "GoPatriots",
>>>>>>>    "indices": []
>>>>>>>  }
>>>>>>> ],
>>>>>>> "user_mentions": []
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> Now we can do this to achieve the results:
>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
>>>>>>> +------------+
>>>>>>> |   EXPR$0   |
>>>>>>> +------------+
>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>> +------------+
>>>>>>> 1 row selected (0.09 seconds)
>>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
>>>>> dfs.tmp.`d.json` t ;
>>>>>>> +------------+
>>>>>>> |   EXPR$0   |
>>>>>>> +------------+
>>>>>>> | GoPatriots |
>>>>>>> +------------+
>>>>>>> 1 row selected (0.108 seconds)
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Hao
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
>>>>>>> [email protected]> wrote:
>>>>>>> 
>>>>>>>> When I run the query on a larger dataset it actually show the empty
>>>>>>>> records.
>>>>>>>> 
>>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
>>>>>>>> 
>>>>>>>> +------------+
>>>>>>>> |   EXPR$0   |
>>>>>>>> +------------+
>>>>>>>> | []         |
>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>> | []         |
>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>>>>> +------------+
>>>>>>>> 10 rows selected (2.899 seconds)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> However having the output as maps is not very useful, unless i can
>>>>> filter
>>>>>>>> out the records with empty arrays and then drill deeper into the ones
>>>>> with
>>>>>>>> data in the arrays.
>>>>>>>> 
>>>>>>>> BTW: Hao I would have expiated your query to return both rows, one
>>>>> with an
>>>>>>>> empty array as above and the other with the array data.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> —Andries
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> I am not sure if below is expected behavior.
>>>>>>>>> If we only select "hashtags", and it will return only 1 row ignoring
>>>>> the
>>>>>>>>> null value.
>>>>>>>>> However then if we try to get "hashtags.text", it fails...which
>>>>> means it
>>>>>>>> is
>>>>>>>>> still trying to read the NULL value.
>>>>>>>>> I am thinking it may confuse the SQL developers.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
>>>>>>>>> +------------+
>>>>>>>>> |   EXPR$0   |
>>>>>>>>> +------------+
>>>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>>>> +------------+
>>>>>>>>> 1 row selected (0.109 seconds)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
>>>>> dfs.tmp.`d.json` t ;
>>>>>>>>> 
>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
>>>>> cast to
>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Error: exception while executing query: Failure while executing
>>>>> query.
>>>>>>>>> (state=,code=0)
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Hao
>>>>>>>>> 
>>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> 
>>>>>>>>>> Now try on hashtags with the following:
>>>>>>>>>> 
>>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
>>>>> `/twitter.json` t
>>>>>>>>>> where t.entities.hashtags is not null limit 10;
>>>>>>>>>> 
>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
>>>>> cast to
>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Error: exception while executing query: Failure while executing
>>>>> query.
>>>>>>>>>> (state=,code=0)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> {
>>>>>>>>>> "entities": {
>>>>>>>>>> "trends": [],
>>>>>>>>>> "symbols": [],
>>>>>>>>>> "urls": [],
>>>>>>>>>> "hashtags": [],
>>>>>>>>>> "user_mentions": []
>>>>>>>>>> },
>>>>>>>>>> "entities": {
>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>> "hashtags": [
>>>>>>>>>> {
>>>>>>>>>>   "text": "GoPatriots",
>>>>>>>>>>   "indices": []
>>>>>>>>>> }
>>>>>>>>>> ],
>>>>>>>>>> "user_mentions": []
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> The issue seems to be that if some records have arrays with maps in
>>>>> them
>>>>>>>>>> and others are empty.
>>>>>>>>>> 
>>>>>>>>>> —Andries
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <[email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Seems it works for below json file:
>>>>>>>>>>> {
>>>>>>>>>>> "entities": {
>>>>>>>>>>> "trends": [],
>>>>>>>>>>> "symbols": [],
>>>>>>>>>>> "urls": [],
>>>>>>>>>>> "hashtags": [
>>>>>>>>>>> {
>>>>>>>>>>>   "text": "GoPatriots",
>>>>>>>>>>>   "indices": [
>>>>>>>>>>>     83,
>>>>>>>>>>>     94
>>>>>>>>>>>   ]
>>>>>>>>>>> }
>>>>>>>>>>> ],
>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>> },
>>>>>>>>>>> "entities": {
>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>> "hashtags": [
>>>>>>>>>>> {
>>>>>>>>>>>   "text": "GoPatriots",
>>>>>>>>>>>   "indices": []
>>>>>>>>>>> }
>>>>>>>>>>> ],
>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
>>>>> where
>>>>>>>>>>> t.entities.urls is not null;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | [7,8,9]    |
>>>>>>>>>>> +------------+
>>>>>>>>>>> 1 row selected (0.139 seconds)
>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
>>>>> where
>>>>>>>>>>> t.entities.urls is null;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> +------------+
>>>>>>>>>>> No rows selected (0.158 seconds)
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Hao
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <[email protected]>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I believe that this works if the array contains homogeneous
>>>>> primitive
>>>>>>>>>>>> types. In your example, it appears from the error, the array
>>> field
>>>>>>>>>> 'member'
>>>>>>>>>>>> contained maps for at least one record.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
>>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
>>> materialize
>>>>>>>>>>>> incoming schema.  Errors:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
>>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
>>>>>>>>>> --UNKNOWN
>>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
>>>>>>>>>> bullseye-3:31010 ]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Error: exception while executing query: Failure while executing
>>>>>>>> query.
>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Chris Matta
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>> 215-701-3146
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
>>> [email protected]
>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> repeated_count('entities.urls') > 0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> How do you filter out records with an empty array in drill?
>>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an array
>>>>> with
>>>>>>>> data
>>>>>>>>>>>> in
>>>>>>>>>>>>>>> it. When trying to read records with data in the array drill
>>>>> fails
>>>>>>>>>> due
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> records missing any data in the array. Trying a filter with/*
>>>>> where
>>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is
>>>>> not
>>>>>>>>>>>> null.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter data as
>>> an
>>>>>>>>>>>> example
>>>>>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[]
>>> and
>>>>>>>>>> others
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> look similar to what is listed below.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>   "text": "GoPatriots",
>>>>>>>>>>>>>>>   "indices": [
>>>>>>>>>>>>>>>     83,
>>>>>>>>>>>>>>>     94
>>>>>>>>>>>>>>>   ]
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> —Andries
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> <z.json>
>>>>> 
>>>>> 
>>> 
>>> 
>

Re: Filter out empty arrays in JSON

Reply via email to