As Aditya commented before this will work if the lists only contain scalars

repeated_count('entities.urls') > 0

If the lists contain maps unfortunately this is not available today. There
is an enhancement request open for this feature. I have marked it for a fix
in 0.9 as it is more of a feature request than a bug and we are working on
closing a large number of bugs for 0.8 before we get to issues like this.

https://issues.apache.org/jira/browse/DRILL-1650

-Jason

On Mon, Jan 26, 2015 at 3:26 PM, Andries Engelbrecht <
[email protected]> wrote:

> Unfortunately it seems that with larger data sets the use of flatten seems
> to produce an error.
>
> Any other options to filter out JSON records (objects) where an array is
> empty?
>
> Thanks
> —Andries
>
>
> On Jan 21, 2015, at 5:18 PM, Andries Engelbrecht <
> [email protected]> wrote:
>
> > It was interesting to see flatten bypass the records with an empty array.
> >
> > Unfortunately some arrays are much more complex than the example here,
> and it is still useful to have the ability to filter out ones that are
> empty.
> >
> > —Andries
> >
> > On Jan 21, 2015, at 5:04 PM, Hao Zhu <[email protected]> wrote:
> >
> >> I figured out the differences after getting the json file from Andries.
> >>
> >> My json file is like:
> >> {“entities”:{xxx},“entities”:{yyy}... }
> >>
> >> Andries' json file is like:
> >> {“entities”:{xxx}}
> >> {“entities”:{yyy}}
> >> ...
> >>
> >> So basically Andries' json file is contains multiple json files.
> >>
> >> Fortunately, we can use the same SQL to get the same results:
> >> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
> >> +------------+
> >> |   EXPR$0   |
> >> +------------+
> >> | []         |
> >> | [{"text":"GoPatriots"}] |
> >> | [{"text":"aaa"}] |
> >> | []         |
> >> | [{"text":"bbb"}] |
> >> +------------+
> >> 5 rows selected (0.136 seconds)
> >> 0: jdbc:drill:> with tmp as
> >> . . . . . . . > (
> >> . . . . . . . > select flatten(t.entities.hashtags) as c from
> >> dfs.tmp.`z2.json` t
> >> . . . . . . . > )
> >> . . . . . . . > select tmp.c.text from tmp;
> >> +------------+
> >> |   EXPR$0   |
> >> +------------+
> >> | GoPatriots |
> >> | aaa        |
> >> | bbb        |
> >> +------------+
> >> 3 rows selected (0.115 seconds)
> >>
> >> Thanks,
> >> Hao
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
> >> [email protected]> wrote:
> >>
> >>> In my case it returns the empty records when flatten is not used.
> >>>
> >>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags
> >>> from `twitter.json` t limit 10;
> >>> +------------+
> >>> |  hashtags  |
> >>> +------------+
> >>> | []         |
> >>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>> | []         |
> >>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>> | []         |
> >>> | []         |
> >>> | []         |
> >>> | []         |
> >>> | []         |
> >>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >>> +------------+
> >>>
> >>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <[email protected]> wrote:
> >>>
> >>>> Actually not due to flatten, if you directly query the file, it will
> only
> >>>> show the non-null values.
> >>>>
> >>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
> >>>> +------------+
> >>>> |   EXPR$0   |
> >>>> +------------+
> >>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> >>>> +------------+
> >>>> 1 row selected (0.123 seconds)
> >>>>
> >>>> Thanks,
> >>>> Hao
> >>>>
> >>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> Very interesting, flatten seems to bypass empty records. Not sure if
> >>> that
> >>>>> is an ideal result for all use cases, but certainly usable in this
> case.
> >>>>>
> >>>>> Thanks
> >>>>> —Andries
> >>>>>
> >>>>>
> >>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <[email protected]> wrote:
> >>>>>
> >>>>>> I can also fetch non-null values for attached json file which
> contains
> >>> 6
> >>>>> "entities", 3 of them are null, 3 of them have "text" value.
> >>>>>>
> >>>>>> Could you share your complete "twitter.json"?
> >>>>>>
> >>>>>> 0: jdbc:drill:> with tmp as
> >>>>>> . . . . . . . > (
> >>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
> >>>>> dfs.tmp.`z.json` t
> >>>>>> . . . . . . . > )
> >>>>>> . . . . . . . > select tmp.c.text from tmp;
> >>>>>> +------------+
> >>>>>> |   EXPR$0   |
> >>>>>> +------------+
> >>>>>> | GoPatriots |
> >>>>>> | aaa        |
> >>>>>> | bbb        |
> >>>>>> +------------+
> >>>>>> 3 rows selected (0.122 seconds)
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Hao
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
> >>>>> [email protected]> wrote:
> >>>>>> The sample data I posted only has 1 element, but some records have
> >>>>> multiple elements in them.
> >>>>>>
> >>>>>> Interestingly enough though
> >>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
> >>>>>>
> >>>>>> Produces
> >>>>>> +------------+
> >>>>>> |   EXPR$0   |
> >>>>>> +------------+
> >>>>>> | null       |
> >>>>>> | SportsNews |
> >>>>>> | null       |
> >>>>>> | SportsNews |
> >>>>>> | null       |
> >>>>>> | null       |
> >>>>>> | null       |
> >>>>>> | null       |
> >>>>>> | null       |
> >>>>>> | CARvsSEA   |
> >>>>>> +——————+
> >>>>>>
> >>>>>>
> >>>>>> And
> >>>>>>
> >>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
> >>>>>>
> >>>>>> +------------+
> >>>>>> |   EXPR$0   |
> >>>>>> +------------+
> >>>>>> | {"indices":[]} |
> >>>>>> | {"text":"SportsNews","indices":[90,99]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"text":"SportsNews","indices":[90,99]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"indices":[]} |
> >>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
> >>>>>> +——————+
> >>>>>>
> >>>>>> Strange part is that there is no indices map in the hashtags array,
> so
> >>>>> no idea why it shows up when pointing to the first lament in an empty
> >>> array.
> >>>>>>
> >>>>>>
> >>>>>> —Andries
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <[email protected]> wrote:
> >>>>>>
> >>>>>>> I just noticed that the result of "hashtags" is just an array with
> >>>>> only 1
> >>>>>>> element.
> >>>>>>> So take your example:
> >>>>>>> [root@maprdemo tmp]# cat d.json
> >>>>>>> {
> >>>>>>> "entities": {
> >>>>>>> "trends": [],
> >>>>>>> "symbols": [],
> >>>>>>> "urls": [],
> >>>>>>> "hashtags": [],
> >>>>>>> "user_mentions": []
> >>>>>>> },
> >>>>>>> "entities": {
> >>>>>>> "trends": [1,2,3],
> >>>>>>> "symbols": [4,5,6],
> >>>>>>> "urls": [7,8,9],
> >>>>>>> "hashtags": [
> >>>>>>>  {
> >>>>>>>    "text": "GoPatriots",
> >>>>>>>    "indices": []
> >>>>>>>  }
> >>>>>>> ],
> >>>>>>> "user_mentions": []
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> Now we can do this to achieve the results:
> >>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t
> ;
> >>>>>>> +------------+
> >>>>>>> |   EXPR$0   |
> >>>>>>> +------------+
> >>>>>>> | [{"text":"GoPatriots"}] |
> >>>>>>> +------------+
> >>>>>>> 1 row selected (0.09 seconds)
> >>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
> >>>>> dfs.tmp.`d.json` t ;
> >>>>>>> +------------+
> >>>>>>> |   EXPR$0   |
> >>>>>>> +------------+
> >>>>>>> | GoPatriots |
> >>>>>>> +------------+
> >>>>>>> 1 row selected (0.108 seconds)
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Hao
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> When I run the query on a larger dataset it actually show the
> empty
> >>>>>>>> records.
> >>>>>>>>
> >>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
> >>>>>>>>
> >>>>>>>> +------------+
> >>>>>>>> |   EXPR$0   |
> >>>>>>>> +------------+
> >>>>>>>> | []         |
> >>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>>>> | []         |
> >>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
> >>>>>>>> | []         |
> >>>>>>>> | []         |
> >>>>>>>> | []         |
> >>>>>>>> | []         |
> >>>>>>>> | []         |
> >>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
> >>>>>>>> +------------+
> >>>>>>>> 10 rows selected (2.899 seconds)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> However having the output as maps is not very useful, unless i can
> >>>>> filter
> >>>>>>>> out the records with empty arrays and then drill deeper into the
> ones
> >>>>> with
> >>>>>>>> data in the arrays.
> >>>>>>>>
> >>>>>>>> BTW: Hao I would have expiated your query to return both rows, one
> >>>>> with an
> >>>>>>>> empty array as above and the other with the array data.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> —Andries
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> I am not sure if below is expected behavior.
> >>>>>>>>> If we only select "hashtags", and it will return only 1 row
> ignoring
> >>>>> the
> >>>>>>>>> null value.
> >>>>>>>>> However then if we try to get "hashtags.text", it fails...which
> >>>>> means it
> >>>>>>>> is
> >>>>>>>>> still trying to read the NULL value.
> >>>>>>>>> I am thinking it may confuse the SQL developers.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
> t ;
> >>>>>>>>> +------------+
> >>>>>>>>> |   EXPR$0   |
> >>>>>>>>> +------------+
> >>>>>>>>> | [{"text":"GoPatriots"}] |
> >>>>>>>>> +------------+
> >>>>>>>>> 1 row selected (0.109 seconds)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
> >>>>> dfs.tmp.`d.json` t ;
> >>>>>>>>>
> >>>>>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> >>>>> cast to
> >>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Error: exception while executing query: Failure while executing
> >>>>> query.
> >>>>>>>>> (state=,code=0)
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Hao
> >>>>>>>>>
> >>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
> >>>>>>>>> [email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>>> Now try on hashtags with the following:
> >>>>>>>>>>
> >>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
> >>>>> `/twitter.json` t
> >>>>>>>>>> where t.entities.hashtags is not null limit 10;
> >>>>>>>>>>
> >>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
> >>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
> >>>>> cast to
> >>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
> >>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Error: exception while executing query: Failure while executing
> >>>>> query.
> >>>>>>>>>> (state=,code=0)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> {
> >>>>>>>>>> "entities": {
> >>>>>>>>>> "trends": [],
> >>>>>>>>>> "symbols": [],
> >>>>>>>>>> "urls": [],
> >>>>>>>>>> "hashtags": [],
> >>>>>>>>>> "user_mentions": []
> >>>>>>>>>> },
> >>>>>>>>>> "entities": {
> >>>>>>>>>> "trends": [1,2,3],
> >>>>>>>>>> "symbols": [4,5,6],
> >>>>>>>>>> "urls": [7,8,9],
> >>>>>>>>>> "hashtags": [
> >>>>>>>>>> {
> >>>>>>>>>>   "text": "GoPatriots",
> >>>>>>>>>>   "indices": []
> >>>>>>>>>> }
> >>>>>>>>>> ],
> >>>>>>>>>> "user_mentions": []
> >>>>>>>>>> }
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> The issue seems to be that if some records have arrays with
> maps in
> >>>>> them
> >>>>>>>>>> and others are empty.
> >>>>>>>>>>
> >>>>>>>>>> —Andries
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Seems it works for below json file:
> >>>>>>>>>>> {
> >>>>>>>>>>> "entities": {
> >>>>>>>>>>> "trends": [],
> >>>>>>>>>>> "symbols": [],
> >>>>>>>>>>> "urls": [],
> >>>>>>>>>>> "hashtags": [
> >>>>>>>>>>> {
> >>>>>>>>>>>   "text": "GoPatriots",
> >>>>>>>>>>>   "indices": [
> >>>>>>>>>>>     83,
> >>>>>>>>>>>     94
> >>>>>>>>>>>   ]
> >>>>>>>>>>> }
> >>>>>>>>>>> ],
> >>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>> },
> >>>>>>>>>>> "entities": {
> >>>>>>>>>>> "trends": [1,2,3],
> >>>>>>>>>>> "symbols": [4,5,6],
> >>>>>>>>>>> "urls": [7,8,9],
> >>>>>>>>>>> "hashtags": [
> >>>>>>>>>>> {
> >>>>>>>>>>>   "text": "GoPatriots",
> >>>>>>>>>>>   "indices": []
> >>>>>>>>>>> }
> >>>>>>>>>>> ],
> >>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>> }
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> as t
> >>>>> where
> >>>>>>>>>>> t.entities.urls is not null;
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> | [7,8,9]    |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> 1 row selected (0.139 seconds)
> >>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
> as t
> >>>>> where
> >>>>>>>>>>> t.entities.urls is null;
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> |   EXPR$0   |
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> +------------+
> >>>>>>>>>>> No rows selected (0.158 seconds)
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Hao
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> I believe that this works if the array contains homogeneous
> >>>>> primitive
> >>>>>>>>>>>> types. In your example, it appears from the error, the array
> >>> field
> >>>>>>>>>> 'member'
> >>>>>>>>>>>> contained maps for at least one record.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
> >>>>> [email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
> >>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
> >>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
> >>> materialize
> >>>>>>>>>>>> incoming schema.  Errors:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
> >>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
> expression:
> >>>>>>>>>> --UNKNOWN
> >>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
> >>>>>>>>>> bullseye-3:31010 ]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Error: exception while executing query: Failure while
> executing
> >>>>>>>> query.
> >>>>>>>>>>>> (state=,code=0)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ​
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Chris Matta
> >>>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>> 215-701-3146
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
> >>> [email protected]
> >>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> repeated_count('entities.urls') > 0
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
> >>>>>>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> How do you filter out records with an empty array in drill?
> >>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an array
> >>>>> with
> >>>>>>>> data
> >>>>>>>>>>>> in
> >>>>>>>>>>>>>>> it. When trying to read records with data in the array
> drill
> >>>>> fails
> >>>>>>>>>> due
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> records missing any data in the array. Trying a filter
> with/*
> >>>>> where
> >>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url
> is
> >>>>> not
> >>>>>>>>>>>> null.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter data
> as
> >>> an
> >>>>>>>>>>>> example
> >>>>>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[]
> >>> and
> >>>>>>>>>> others
> >>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>> look similar to what is listed below.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> "entities": {
> >>>>>>>>>>>>>>> "trends": [],
> >>>>>>>>>>>>>>> "symbols": [],
> >>>>>>>>>>>>>>> "urls": [],
> >>>>>>>>>>>>>>> "hashtags": [
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>   "text": "GoPatriots",
> >>>>>>>>>>>>>>>   "indices": [
> >>>>>>>>>>>>>>>     83,
> >>>>>>>>>>>>>>>     94
> >>>>>>>>>>>>>>>   ]
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>> ],
> >>>>>>>>>>>>>>> "user_mentions": []
> >>>>>>>>>>>>>>> },
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>> —Andries
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> <z.json>
> >>>>>
> >>>>>
> >>>
> >>>
> >
>
>

Reply via email to