Hao, It looks to me like your json file is actually invalid, and it seems like Drill should be throwing an exception.
On Wed, Jan 21, 2015 at 5:04 PM, Hao Zhu <[email protected]> wrote: > I figured out the differences after getting the json file from Andries. > > My json file is like: > {“entities”:{xxx},“entities”:{yyy}... } > > Andries' json file is like: > {“entities”:{xxx}} > {“entities”:{yyy}} > ... > > So basically Andries' json file is contains multiple json files. > > Fortunately, we can use the same SQL to get the same results: > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t; > +------------+ > | EXPR$0 | > +------------+ > | [] | > | [{"text":"GoPatriots"}] | > | [{"text":"aaa"}] | > | [] | > | [{"text":"bbb"}] | > +------------+ > 5 rows selected (0.136 seconds) > 0: jdbc:drill:> with tmp as > . . . . . . . > ( > . . . . . . . > select flatten(t.entities.hashtags) as c from > dfs.tmp.`z2.json` t > . . . . . . . > ) > . . . . . . . > select tmp.c.text from tmp; > +------------+ > | EXPR$0 | > +------------+ > | GoPatriots | > | aaa | > | bbb | > +------------+ > 3 rows selected (0.115 seconds) > > Thanks, > Hao > > > > > > > On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht < > [email protected]> wrote: > > > In my case it returns the empty records when flatten is not used. > > > > 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags > > from `twitter.json` t limit 10; > > +------------+ > > | hashtags | > > +------------+ > > | [] | > > | [{"text":"SportsNews","indices":[0,11]}] | > > | [] | > > | [{"text":"SportsNews","indices":[0,11]}] | > > | [] | > > | [] | > > | [] | > > | [] | > > | [] | > > | [{"text":"CARvsSEA","indices":[36,45]}] | > > +------------+ > > > > On Jan 21, 2015, at 4:26 PM, Hao Zhu <[email protected]> wrote: > > > > > Actually not due to flatten, if you directly query the file, it will > only > > > show the non-null values. > > > > > > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t; > > > +------------+ > > > | EXPR$0 | > > > +------------+ > > > | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] | > > > +------------+ > > > 1 row selected (0.123 seconds) > > > > > > Thanks, > > > Hao > > > > > > On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht < > > > [email protected]> wrote: > > > > > >> Very interesting, flatten seems to bypass empty records. Not sure if > > that > > >> is an ideal result for all use cases, but certainly usable in this > case. > > >> > > >> Thanks > > >> —Andries > > >> > > >> > > >> On Jan 21, 2015, at 3:45 PM, Hao Zhu <[email protected]> wrote: > > >> > > >>> I can also fetch non-null values for attached json file which > contains > > 6 > > >> "entities", 3 of them are null, 3 of them have "text" value. > > >>> > > >>> Could you share your complete "twitter.json"? > > >>> > > >>> 0: jdbc:drill:> with tmp as > > >>> . . . . . . . > ( > > >>> . . . . . . . > select flatten(t.entities.hashtags) as c from > > >> dfs.tmp.`z.json` t > > >>> . . . . . . . > ) > > >>> . . . . . . . > select tmp.c.text from tmp; > > >>> +------------+ > > >>> | EXPR$0 | > > >>> +------------+ > > >>> | GoPatriots | > > >>> | aaa | > > >>> | bbb | > > >>> +------------+ > > >>> 3 rows selected (0.122 seconds) > > >>> > > >>> Thanks, > > >>> Hao > > >>> > > >>> > > >>> > > >>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht < > > >> [email protected]> wrote: > > >>> The sample data I posted only has 1 element, but some records have > > >> multiple elements in them. > > >>> > > >>> Interestingly enough though > > >>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10; > > >>> > > >>> Produces > > >>> +------------+ > > >>> | EXPR$0 | > > >>> +------------+ > > >>> | null | > > >>> | SportsNews | > > >>> | null | > > >>> | SportsNews | > > >>> | null | > > >>> | null | > > >>> | null | > > >>> | null | > > >>> | null | > > >>> | CARvsSEA | > > >>> +——————+ > > >>> > > >>> > > >>> And > > >>> > > >>> select t.entities.hashtags[0] from `twitter.json` t limit 10; > > >>> > > >>> +------------+ > > >>> | EXPR$0 | > > >>> +------------+ > > >>> | {"indices":[]} | > > >>> | {"text":"SportsNews","indices":[90,99]} | > > >>> | {"indices":[]} | > > >>> | {"text":"SportsNews","indices":[90,99]} | > > >>> | {"indices":[]} | > > >>> | {"indices":[]} | > > >>> | {"indices":[]} | > > >>> | {"indices":[]} | > > >>> | {"indices":[]} | > > >>> | {"text":"CARvsSEA","indices":[90,99]} | > > >>> +——————+ > > >>> > > >>> Strange part is that there is no indices map in the hashtags array, > so > > >> no idea why it shows up when pointing to the first lament in an empty > > array. > > >>> > > >>> > > >>> —Andries > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <[email protected]> wrote: > > >>> > > >>>> I just noticed that the result of "hashtags" is just an array with > > >> only 1 > > >>>> element. > > >>>> So take your example: > > >>>> [root@maprdemo tmp]# cat d.json > > >>>> { > > >>>> "entities": { > > >>>> "trends": [], > > >>>> "symbols": [], > > >>>> "urls": [], > > >>>> "hashtags": [], > > >>>> "user_mentions": [] > > >>>> }, > > >>>> "entities": { > > >>>> "trends": [1,2,3], > > >>>> "symbols": [4,5,6], > > >>>> "urls": [7,8,9], > > >>>> "hashtags": [ > > >>>> { > > >>>> "text": "GoPatriots", > > >>>> "indices": [] > > >>>> } > > >>>> ], > > >>>> "user_mentions": [] > > >>>> } > > >>>> } > > >>>> > > >>>> Now we can do this to achieve the results: > > >>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ; > > >>>> +------------+ > > >>>> | EXPR$0 | > > >>>> +------------+ > > >>>> | [{"text":"GoPatriots"}] | > > >>>> +------------+ > > >>>> 1 row selected (0.09 seconds) > > >>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from > > >> dfs.tmp.`d.json` t ; > > >>>> +------------+ > > >>>> | EXPR$0 | > > >>>> +------------+ > > >>>> | GoPatriots | > > >>>> +------------+ > > >>>> 1 row selected (0.108 seconds) > > >>>> > > >>>> Thanks, > > >>>> Hao > > >>>> > > >>>> > > >>>> > > >>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht < > > >>>> [email protected]> wrote: > > >>>> > > >>>>> When I run the query on a larger dataset it actually show the empty > > >>>>> records. > > >>>>> > > >>>>> select t.entities.hashtags from `twitter.json` t limit 10; > > >>>>> > > >>>>> +------------+ > > >>>>> | EXPR$0 | > > >>>>> +------------+ > > >>>>> | [] | > > >>>>> | [{"text":"SportsNews","indices":[0,11]}] | > > >>>>> | [] | > > >>>>> | [{"text":"SportsNews","indices":[0,11]}] | > > >>>>> | [] | > > >>>>> | [] | > > >>>>> | [] | > > >>>>> | [] | > > >>>>> | [] | > > >>>>> | [{"text":"CARvsSEA","indices":[36,45]}] | > > >>>>> +------------+ > > >>>>> 10 rows selected (2.899 seconds) > > >>>>> > > >>>>> > > >>>>> > > >>>>> However having the output as maps is not very useful, unless i can > > >> filter > > >>>>> out the records with empty arrays and then drill deeper into the > ones > > >> with > > >>>>> data in the arrays. > > >>>>> > > >>>>> BTW: Hao I would have expiated your query to return both rows, one > > >> with an > > >>>>> empty array as above and the other with the array data. > > >>>>> > > >>>>> > > >>>>> —Andries > > >>>>> > > >>>>> > > >>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <[email protected]> wrote: > > >>>>> > > >>>>>> I am not sure if below is expected behavior. > > >>>>>> If we only select "hashtags", and it will return only 1 row > ignoring > > >> the > > >>>>>> null value. > > >>>>>> However then if we try to get "hashtags.text", it fails...which > > >> means it > > >>>>> is > > >>>>>> still trying to read the NULL value. > > >>>>>> I am thinking it may confuse the SQL developers. > > >>>>>> > > >>>>>> > > >>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` > t ; > > >>>>>> +------------+ > > >>>>>> | EXPR$0 | > > >>>>>> +------------+ > > >>>>>> | [{"text":"GoPatriots"}] | > > >>>>>> +------------+ > > >>>>>> 1 row selected (0.109 seconds) > > >>>>>> > > >>>>>> > > >>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from > > >> dfs.tmp.`d.json` t ; > > >>>>>> > > >>>>>> Query failed: Query failed: Failure while running fragment., > > >>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be > > >> cast to > > >>>>>> org.apache.drill.exec.vector.complex.MapVector [ > > >>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ] > > >>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ] > > >>>>>> > > >>>>>> > > >>>>>> Error: exception while executing query: Failure while executing > > >> query. > > >>>>>> (state=,code=0) > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Hao > > >>>>>> > > >>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht < > > >>>>>> [email protected]> wrote: > > >>>>>> > > >>>>>>> Now try on hashtags with the following: > > >>>>>>> > > >>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from > > >> `/twitter.json` t > > >>>>>>> where t.entities.hashtags is not null limit 10; > > >>>>>>> > > >>>>>>> Query failed: Query failed: Failure while running fragment., > > >>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be > > >> cast to > > >>>>>>> org.apache.drill.exec.vector.complex.MapVector [ > > >>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ] > > >>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ] > > >>>>>>> > > >>>>>>> > > >>>>>>> Error: exception while executing query: Failure while executing > > >> query. > > >>>>>>> (state=,code=0) > > >>>>>>> > > >>>>>>> > > >>>>>>> { > > >>>>>>> "entities": { > > >>>>>>> "trends": [], > > >>>>>>> "symbols": [], > > >>>>>>> "urls": [], > > >>>>>>> "hashtags": [], > > >>>>>>> "user_mentions": [] > > >>>>>>> }, > > >>>>>>> "entities": { > > >>>>>>> "trends": [1,2,3], > > >>>>>>> "symbols": [4,5,6], > > >>>>>>> "urls": [7,8,9], > > >>>>>>> "hashtags": [ > > >>>>>>> { > > >>>>>>> "text": "GoPatriots", > > >>>>>>> "indices": [] > > >>>>>>> } > > >>>>>>> ], > > >>>>>>> "user_mentions": [] > > >>>>>>> } > > >>>>>>> } > > >>>>>>> > > >>>>>>> The issue seems to be that if some records have arrays with maps > in > > >> them > > >>>>>>> and others are empty. > > >>>>>>> > > >>>>>>> —Andries > > >>>>>>> > > >>>>>>> > > >>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <[email protected]> wrote: > > >>>>>>> > > >>>>>>>> Seems it works for below json file: > > >>>>>>>> { > > >>>>>>>> "entities": { > > >>>>>>>> "trends": [], > > >>>>>>>> "symbols": [], > > >>>>>>>> "urls": [], > > >>>>>>>> "hashtags": [ > > >>>>>>>> { > > >>>>>>>> "text": "GoPatriots", > > >>>>>>>> "indices": [ > > >>>>>>>> 83, > > >>>>>>>> 94 > > >>>>>>>> ] > > >>>>>>>> } > > >>>>>>>> ], > > >>>>>>>> "user_mentions": [] > > >>>>>>>> }, > > >>>>>>>> "entities": { > > >>>>>>>> "trends": [1,2,3], > > >>>>>>>> "symbols": [4,5,6], > > >>>>>>>> "urls": [7,8,9], > > >>>>>>>> "hashtags": [ > > >>>>>>>> { > > >>>>>>>> "text": "GoPatriots", > > >>>>>>>> "indices": [] > > >>>>>>>> } > > >>>>>>>> ], > > >>>>>>>> "user_mentions": [] > > >>>>>>>> } > > >>>>>>>> } > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as > t > > >> where > > >>>>>>>> t.entities.urls is not null; > > >>>>>>>> +------------+ > > >>>>>>>> | EXPR$0 | > > >>>>>>>> +------------+ > > >>>>>>>> | [7,8,9] | > > >>>>>>>> +------------+ > > >>>>>>>> 1 row selected (0.139 seconds) > > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as > t > > >> where > > >>>>>>>> t.entities.urls is null; > > >>>>>>>> +------------+ > > >>>>>>>> | EXPR$0 | > > >>>>>>>> +------------+ > > >>>>>>>> +------------+ > > >>>>>>>> No rows selected (0.158 seconds) > > >>>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> Hao > > >>>>>>>> > > >>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya < > [email protected]> > > >>>>> wrote: > > >>>>>>>> > > >>>>>>>>> I believe that this works if the array contains homogeneous > > >> primitive > > >>>>>>>>> types. In your example, it appears from the error, the array > > field > > >>>>>>> 'member' > > >>>>>>>>> contained maps for at least one record. > > >>>>>>>>> > > >>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta < > > >> [email protected]> > > >>>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> Trying that locally did not work for me (drill 0.7.0): > > >>>>>>>>>> > > >>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from > > >>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0; > > >>>>>>>>>> Query failed: Query stopped., Failure while trying to > > materialize > > >>>>>>>>> incoming schema. Errors: > > >>>>>>>>>> > > >>>>>>>>>> Error in expression at index -1. Error: Missing function > > >>>>>>>>> implementation: [repeated_count(MAP-REPEATED)]. Full > expression: > > >>>>>>> --UNKNOWN > > >>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on > > >>>>>>> bullseye-3:31010 ] > > >>>>>>>>>> > > >>>>>>>>>> Error: exception while executing query: Failure while > executing > > >>>>> query. > > >>>>>>>>> (state=,code=0) > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Chris Matta > > >>>>>>>>>> [email protected] > > >>>>>>>>>> 215-701-3146 > > >>>>>>>>>> > > >>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya < > > [email protected] > > >>> > > >>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> repeated_count('entities.urls') > 0 > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht < > > >>>>>>>>>>> [email protected]> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>>> How do you filter out records with an empty array in drill? > > >>>>>>>>>>>> i.e some records have "url":[] and some will have an array > > >> with > > >>>>> data > > >>>>>>>>> in > > >>>>>>>>>>>> it. When trying to read records with data in the array drill > > >> fails > > >>>>>>> due > > >>>>>>>>>>> to > > >>>>>>>>>>>> records missing any data in the array. Trying a filter > with/* > > >> where > > >>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url > is > > >> not > > >>>>>>>>> null. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Note some of the arrays contains maps, using twitter data as > > an > > >>>>>>>>> example > > >>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[] > > and > > >>>>>>> others > > >>>>>>>>>>> will > > >>>>>>>>>>>> look similar to what is listed below. > > >>>>>>>>>>>> > > >>>>>>>>>>>> "entities": { > > >>>>>>>>>>>> "trends": [], > > >>>>>>>>>>>> "symbols": [], > > >>>>>>>>>>>> "urls": [], > > >>>>>>>>>>>> "hashtags": [ > > >>>>>>>>>>>> { > > >>>>>>>>>>>> "text": "GoPatriots", > > >>>>>>>>>>>> "indices": [ > > >>>>>>>>>>>> 83, > > >>>>>>>>>>>> 94 > > >>>>>>>>>>>> ] > > >>>>>>>>>>>> } > > >>>>>>>>>>>> ], > > >>>>>>>>>>>> "user_mentions": [] > > >>>>>>>>>>>> }, > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks > > >>>>>>>>>>>> —Andries > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>> > > >>>>> > > >>> > > >>> > > >>> <z.json> > > >> > > >> > > > > > -- Steven Phillips Software Engineer mapr.com
