DRILL-2055 is filed. Thanks, Hao
On Wed, Jan 21, 2015 at 5:37 PM, Jacques Nadeau <[email protected]> wrote: > Steven, are you referring to the fact that he has two map keys with the > same name? > > If so, is that technically invalid json? > > I generally agree with this post [1]. > > > http://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object#answer-23195243 > > If others agree, I think it would be appropriate for Hao to file a JIRA to > make sure we follow and check this convention. > > J > > On Wed, Jan 21, 2015 at 5:07 PM, Steven Phillips <[email protected]> > wrote: > > > Hao, > > > > It looks to me like your json file is actually invalid, and it seems like > > Drill should be throwing an exception. > > > > On Wed, Jan 21, 2015 at 5:04 PM, Hao Zhu <[email protected]> wrote: > > > > > I figured out the differences after getting the json file from Andries. > > > > > > My json file is like: > > > {“entities”:{xxx},“entities”:{yyy}... } > > > > > > Andries' json file is like: > > > {“entities”:{xxx}} > > > {“entities”:{yyy}} > > > ... > > > > > > So basically Andries' json file is contains multiple json files. > > > > > > Fortunately, we can use the same SQL to get the same results: > > > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t; > > > +------------+ > > > | EXPR$0 | > > > +------------+ > > > | [] | > > > | [{"text":"GoPatriots"}] | > > > | [{"text":"aaa"}] | > > > | [] | > > > | [{"text":"bbb"}] | > > > +------------+ > > > 5 rows selected (0.136 seconds) > > > 0: jdbc:drill:> with tmp as > > > . . . . . . . > ( > > > . . . . . . . > select flatten(t.entities.hashtags) as c from > > > dfs.tmp.`z2.json` t > > > . . . . . . . > ) > > > . . . . . . . > select tmp.c.text from tmp; > > > +------------+ > > > | EXPR$0 | > > > +------------+ > > > | GoPatriots | > > > | aaa | > > > | bbb | > > > +------------+ > > > 3 rows selected (0.115 seconds) > > > > > > Thanks, > > > Hao > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht < > > > [email protected]> wrote: > > > > > > > In my case it returns the empty records when flatten is not used. > > > > > > > > 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as > hashtags > > > > from `twitter.json` t limit 10; > > > > +------------+ > > > > | hashtags | > > > > +------------+ > > > > | [] | > > > > | [{"text":"SportsNews","indices":[0,11]}] | > > > > | [] | > > > > | [{"text":"SportsNews","indices":[0,11]}] | > > > > | [] | > > > > | [] | > > > > | [] | > > > > | [] | > > > > | [] | > > > > | [{"text":"CARvsSEA","indices":[36,45]}] | > > > > +------------+ > > > > > > > > On Jan 21, 2015, at 4:26 PM, Hao Zhu <[email protected]> wrote: > > > > > > > > > Actually not due to flatten, if you directly query the file, it > will > > > only > > > > > show the non-null values. > > > > > > > > > > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t; > > > > > +------------+ > > > > > | EXPR$0 | > > > > > +------------+ > > > > > | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] | > > > > > +------------+ > > > > > 1 row selected (0.123 seconds) > > > > > > > > > > Thanks, > > > > > Hao > > > > > > > > > > On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht < > > > > > [email protected]> wrote: > > > > > > > > > >> Very interesting, flatten seems to bypass empty records. Not sure > if > > > > that > > > > >> is an ideal result for all use cases, but certainly usable in this > > > case. > > > > >> > > > > >> Thanks > > > > >> —Andries > > > > >> > > > > >> > > > > >> On Jan 21, 2015, at 3:45 PM, Hao Zhu <[email protected]> wrote: > > > > >> > > > > >>> I can also fetch non-null values for attached json file which > > > contains > > > > 6 > > > > >> "entities", 3 of them are null, 3 of them have "text" value. > > > > >>> > > > > >>> Could you share your complete "twitter.json"? > > > > >>> > > > > >>> 0: jdbc:drill:> with tmp as > > > > >>> . . . . . . . > ( > > > > >>> . . . . . . . > select flatten(t.entities.hashtags) as c from > > > > >> dfs.tmp.`z.json` t > > > > >>> . . . . . . . > ) > > > > >>> . . . . . . . > select tmp.c.text from tmp; > > > > >>> +------------+ > > > > >>> | EXPR$0 | > > > > >>> +------------+ > > > > >>> | GoPatriots | > > > > >>> | aaa | > > > > >>> | bbb | > > > > >>> +------------+ > > > > >>> 3 rows selected (0.122 seconds) > > > > >>> > > > > >>> Thanks, > > > > >>> Hao > > > > >>> > > > > >>> > > > > >>> > > > > >>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht < > > > > >> [email protected]> wrote: > > > > >>> The sample data I posted only has 1 element, but some records > have > > > > >> multiple elements in them. > > > > >>> > > > > >>> Interestingly enough though > > > > >>> select t.entities.hashtags[0].`text` from `twitter.json` t limit > > 10; > > > > >>> > > > > >>> Produces > > > > >>> +------------+ > > > > >>> | EXPR$0 | > > > > >>> +------------+ > > > > >>> | null | > > > > >>> | SportsNews | > > > > >>> | null | > > > > >>> | SportsNews | > > > > >>> | null | > > > > >>> | null | > > > > >>> | null | > > > > >>> | null | > > > > >>> | null | > > > > >>> | CARvsSEA | > > > > >>> +——————+ > > > > >>> > > > > >>> > > > > >>> And > > > > >>> > > > > >>> select t.entities.hashtags[0] from `twitter.json` t limit 10; > > > > >>> > > > > >>> +------------+ > > > > >>> | EXPR$0 | > > > > >>> +------------+ > > > > >>> | {"indices":[]} | > > > > >>> | {"text":"SportsNews","indices":[90,99]} | > > > > >>> | {"indices":[]} | > > > > >>> | {"text":"SportsNews","indices":[90,99]} | > > > > >>> | {"indices":[]} | > > > > >>> | {"indices":[]} | > > > > >>> | {"indices":[]} | > > > > >>> | {"indices":[]} | > > > > >>> | {"indices":[]} | > > > > >>> | {"text":"CARvsSEA","indices":[90,99]} | > > > > >>> +——————+ > > > > >>> > > > > >>> Strange part is that there is no indices map in the hashtags > array, > > > so > > > > >> no idea why it shows up when pointing to the first lament in an > > empty > > > > array. > > > > >>> > > > > >>> > > > > >>> —Andries > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <[email protected]> wrote: > > > > >>> > > > > >>>> I just noticed that the result of "hashtags" is just an array > with > > > > >> only 1 > > > > >>>> element. > > > > >>>> So take your example: > > > > >>>> [root@maprdemo tmp]# cat d.json > > > > >>>> { > > > > >>>> "entities": { > > > > >>>> "trends": [], > > > > >>>> "symbols": [], > > > > >>>> "urls": [], > > > > >>>> "hashtags": [], > > > > >>>> "user_mentions": [] > > > > >>>> }, > > > > >>>> "entities": { > > > > >>>> "trends": [1,2,3], > > > > >>>> "symbols": [4,5,6], > > > > >>>> "urls": [7,8,9], > > > > >>>> "hashtags": [ > > > > >>>> { > > > > >>>> "text": "GoPatriots", > > > > >>>> "indices": [] > > > > >>>> } > > > > >>>> ], > > > > >>>> "user_mentions": [] > > > > >>>> } > > > > >>>> } > > > > >>>> > > > > >>>> Now we can do this to achieve the results: > > > > >>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` > > t ; > > > > >>>> +------------+ > > > > >>>> | EXPR$0 | > > > > >>>> +------------+ > > > > >>>> | [{"text":"GoPatriots"}] | > > > > >>>> +------------+ > > > > >>>> 1 row selected (0.09 seconds) > > > > >>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from > > > > >> dfs.tmp.`d.json` t ; > > > > >>>> +------------+ > > > > >>>> | EXPR$0 | > > > > >>>> +------------+ > > > > >>>> | GoPatriots | > > > > >>>> +------------+ > > > > >>>> 1 row selected (0.108 seconds) > > > > >>>> > > > > >>>> Thanks, > > > > >>>> Hao > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht < > > > > >>>> [email protected]> wrote: > > > > >>>> > > > > >>>>> When I run the query on a larger dataset it actually show the > > empty > > > > >>>>> records. > > > > >>>>> > > > > >>>>> select t.entities.hashtags from `twitter.json` t limit 10; > > > > >>>>> > > > > >>>>> +------------+ > > > > >>>>> | EXPR$0 | > > > > >>>>> +------------+ > > > > >>>>> | [] | > > > > >>>>> | [{"text":"SportsNews","indices":[0,11]}] | > > > > >>>>> | [] | > > > > >>>>> | [{"text":"SportsNews","indices":[0,11]}] | > > > > >>>>> | [] | > > > > >>>>> | [] | > > > > >>>>> | [] | > > > > >>>>> | [] | > > > > >>>>> | [] | > > > > >>>>> | [{"text":"CARvsSEA","indices":[36,45]}] | > > > > >>>>> +------------+ > > > > >>>>> 10 rows selected (2.899 seconds) > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> However having the output as maps is not very useful, unless i > > can > > > > >> filter > > > > >>>>> out the records with empty arrays and then drill deeper into > the > > > ones > > > > >> with > > > > >>>>> data in the arrays. > > > > >>>>> > > > > >>>>> BTW: Hao I would have expiated your query to return both rows, > > one > > > > >> with an > > > > >>>>> empty array as above and the other with the array data. > > > > >>>>> > > > > >>>>> > > > > >>>>> —Andries > > > > >>>>> > > > > >>>>> > > > > >>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <[email protected]> > wrote: > > > > >>>>> > > > > >>>>>> I am not sure if below is expected behavior. > > > > >>>>>> If we only select "hashtags", and it will return only 1 row > > > ignoring > > > > >> the > > > > >>>>>> null value. > > > > >>>>>> However then if we try to get "hashtags.text", it > fails...which > > > > >> means it > > > > >>>>> is > > > > >>>>>> still trying to read the NULL value. > > > > >>>>>> I am thinking it may confuse the SQL developers. > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> 0: jdbc:drill:> select t.entities.hashtags from > dfs.tmp.`d.json` > > > t ; > > > > >>>>>> +------------+ > > > > >>>>>> | EXPR$0 | > > > > >>>>>> +------------+ > > > > >>>>>> | [{"text":"GoPatriots"}] | > > > > >>>>>> +------------+ > > > > >>>>>> 1 row selected (0.109 seconds) > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from > > > > >> dfs.tmp.`d.json` t ; > > > > >>>>>> > > > > >>>>>> Query failed: Query failed: Failure while running fragment., > > > > >>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot > be > > > > >> cast to > > > > >>>>>> org.apache.drill.exec.vector.complex.MapVector [ > > > > >>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ] > > > > >>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ] > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> Error: exception while executing query: Failure while > executing > > > > >> query. > > > > >>>>>> (state=,code=0) > > > > >>>>>> > > > > >>>>>> Thanks, > > > > >>>>>> Hao > > > > >>>>>> > > > > >>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht < > > > > >>>>>> [email protected]> wrote: > > > > >>>>>> > > > > >>>>>>> Now try on hashtags with the following: > > > > >>>>>>> > > > > >>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from > > > > >> `/twitter.json` t > > > > >>>>>>> where t.entities.hashtags is not null limit 10; > > > > >>>>>>> > > > > >>>>>>> Query failed: Query failed: Failure while running fragment., > > > > >>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot > > be > > > > >> cast to > > > > >>>>>>> org.apache.drill.exec.vector.complex.MapVector [ > > > > >>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ] > > > > >>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ] > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> Error: exception while executing query: Failure while > executing > > > > >> query. > > > > >>>>>>> (state=,code=0) > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> { > > > > >>>>>>> "entities": { > > > > >>>>>>> "trends": [], > > > > >>>>>>> "symbols": [], > > > > >>>>>>> "urls": [], > > > > >>>>>>> "hashtags": [], > > > > >>>>>>> "user_mentions": [] > > > > >>>>>>> }, > > > > >>>>>>> "entities": { > > > > >>>>>>> "trends": [1,2,3], > > > > >>>>>>> "symbols": [4,5,6], > > > > >>>>>>> "urls": [7,8,9], > > > > >>>>>>> "hashtags": [ > > > > >>>>>>> { > > > > >>>>>>> "text": "GoPatriots", > > > > >>>>>>> "indices": [] > > > > >>>>>>> } > > > > >>>>>>> ], > > > > >>>>>>> "user_mentions": [] > > > > >>>>>>> } > > > > >>>>>>> } > > > > >>>>>>> > > > > >>>>>>> The issue seems to be that if some records have arrays with > > maps > > > in > > > > >> them > > > > >>>>>>> and others are empty. > > > > >>>>>>> > > > > >>>>>>> —Andries > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <[email protected]> > > wrote: > > > > >>>>>>> > > > > >>>>>>>> Seems it works for below json file: > > > > >>>>>>>> { > > > > >>>>>>>> "entities": { > > > > >>>>>>>> "trends": [], > > > > >>>>>>>> "symbols": [], > > > > >>>>>>>> "urls": [], > > > > >>>>>>>> "hashtags": [ > > > > >>>>>>>> { > > > > >>>>>>>> "text": "GoPatriots", > > > > >>>>>>>> "indices": [ > > > > >>>>>>>> 83, > > > > >>>>>>>> 94 > > > > >>>>>>>> ] > > > > >>>>>>>> } > > > > >>>>>>>> ], > > > > >>>>>>>> "user_mentions": [] > > > > >>>>>>>> }, > > > > >>>>>>>> "entities": { > > > > >>>>>>>> "trends": [1,2,3], > > > > >>>>>>>> "symbols": [4,5,6], > > > > >>>>>>>> "urls": [7,8,9], > > > > >>>>>>>> "hashtags": [ > > > > >>>>>>>> { > > > > >>>>>>>> "text": "GoPatriots", > > > > >>>>>>>> "indices": [] > > > > >>>>>>>> } > > > > >>>>>>>> ], > > > > >>>>>>>> "user_mentions": [] > > > > >>>>>>>> } > > > > >>>>>>>> } > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` > > as > > > t > > > > >> where > > > > >>>>>>>> t.entities.urls is not null; > > > > >>>>>>>> +------------+ > > > > >>>>>>>> | EXPR$0 | > > > > >>>>>>>> +------------+ > > > > >>>>>>>> | [7,8,9] | > > > > >>>>>>>> +------------+ > > > > >>>>>>>> 1 row selected (0.139 seconds) > > > > >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` > > as > > > t > > > > >> where > > > > >>>>>>>> t.entities.urls is null; > > > > >>>>>>>> +------------+ > > > > >>>>>>>> | EXPR$0 | > > > > >>>>>>>> +------------+ > > > > >>>>>>>> +------------+ > > > > >>>>>>>> No rows selected (0.158 seconds) > > > > >>>>>>>> > > > > >>>>>>>> Thanks, > > > > >>>>>>>> Hao > > > > >>>>>>>> > > > > >>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya < > > > [email protected]> > > > > >>>>> wrote: > > > > >>>>>>>> > > > > >>>>>>>>> I believe that this works if the array contains homogeneous > > > > >> primitive > > > > >>>>>>>>> types. In your example, it appears from the error, the > array > > > > field > > > > >>>>>>> 'member' > > > > >>>>>>>>> contained maps for at least one record. > > > > >>>>>>>>> > > > > >>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta < > > > > >> [email protected]> > > > > >>>>>>>>> wrote: > > > > >>>>>>>>> > > > > >>>>>>>>>> Trying that locally did not work for me (drill 0.7.0): > > > > >>>>>>>>>> > > > > >>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` > from > > > > >>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0; > > > > >>>>>>>>>> Query failed: Query stopped., Failure while trying to > > > > materialize > > > > >>>>>>>>> incoming schema. Errors: > > > > >>>>>>>>>> > > > > >>>>>>>>>> Error in expression at index -1. Error: Missing function > > > > >>>>>>>>> implementation: [repeated_count(MAP-REPEATED)]. Full > > > expression: > > > > >>>>>>> --UNKNOWN > > > > >>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on > > > > >>>>>>> bullseye-3:31010 ] > > > > >>>>>>>>>> > > > > >>>>>>>>>> Error: exception while executing query: Failure while > > > executing > > > > >>>>> query. > > > > >>>>>>>>> (state=,code=0) > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> Chris Matta > > > > >>>>>>>>>> [email protected] > > > > >>>>>>>>>> 215-701-3146 > > > > >>>>>>>>>> > > > > >>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya < > > > > [email protected] > > > > >>> > > > > >>>>>>> wrote: > > > > >>>>>>>>>> > > > > >>>>>>>>>>> repeated_count('entities.urls') > 0 > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht < > > > > >>>>>>>>>>> [email protected]> wrote: > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> How do you filter out records with an empty array in > > drill? > > > > >>>>>>>>>>>> i.e some records have "url":[] and some will have an > > array > > > > >> with > > > > >>>>> data > > > > >>>>>>>>> in > > > > >>>>>>>>>>>> it. When trying to read records with data in the array > > drill > > > > >> fails > > > > >>>>>>> due > > > > >>>>>>>>>>> to > > > > >>>>>>>>>>>> records missing any data in the array. Trying a filter > > > with/* > > > > >> where > > > > >>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying > url > > > is > > > > >> not > > > > >>>>>>>>> null. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> Note some of the arrays contains maps, using twitter > data > > as > > > > an > > > > >>>>>>>>> example > > > > >>>>>>>>>>>> below. Some records have an empty array with > “hashtags”:[] > > > > and > > > > >>>>>>> others > > > > >>>>>>>>>>> will > > > > >>>>>>>>>>>> look similar to what is listed below. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> "entities": { > > > > >>>>>>>>>>>> "trends": [], > > > > >>>>>>>>>>>> "symbols": [], > > > > >>>>>>>>>>>> "urls": [], > > > > >>>>>>>>>>>> "hashtags": [ > > > > >>>>>>>>>>>> { > > > > >>>>>>>>>>>> "text": "GoPatriots", > > > > >>>>>>>>>>>> "indices": [ > > > > >>>>>>>>>>>> 83, > > > > >>>>>>>>>>>> 94 > > > > >>>>>>>>>>>> ] > > > > >>>>>>>>>>>> } > > > > >>>>>>>>>>>> ], > > > > >>>>>>>>>>>> "user_mentions": [] > > > > >>>>>>>>>>>> }, > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> Thanks > > > > >>>>>>>>>>>> —Andries > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>> > > > > >>>>> > > > > >>> > > > > >>> > > > > >>> <z.json> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > > > -- > > Steven Phillips > > Software Engineer > > > > mapr.com > > >
