thank you, I got it :). Completely understanding the Having clause "limitation" and just missing the column aliasing a bit. When I looked at the null results I was so sure that there were no nulls in the data-set that I did not bother to double check and just wrote the email.
Thanks for the assistance, -Stefan On Thu, Jul 30, 2015 at 4:11 PM, Jinfeng Ni <[email protected]> wrote: > This is related to using alias in having clause [1]. Drill does not allow > using alias in having clause. > > > Q1. > select p.client_ip.country_code country_code, count(*) event_count from > dfs.tmp.`/events` as p where event_count > 30000 group by > p.client_ip.country_code; > > In Q1, the event-count in the where clause is not refer to the alias > "event_count". In stead, it is resolved to refer to a column from the table > "events". In the table, "event_count" does not exists, and is treated as > null. That's why you get 0 rows. > > Q2. > select p.client_ip.country_code country_code, count(*) event_count from > dfs.tmp.`/events` as p group by p.client_ip.country_code having event_count > > 30000; > > Again, "event_count" in having clause is not resolved to alias > "event_count" in select list. That's why SQL validation fail. > > > Q3 > select p.client_ip.country_code country_code, count(*) event_count from > dfs.tmp.`/events` as p group by p.client_ip.country_code having count(*) > > 30000; > > This is the right approach. > > Regarding the null value and the total count, I tried with a slightly > modified input ( adding two rows with country_code as null). > > {"client_ip":{"country_code":"US"}} > {"client_ip":{"country_code":"US"}} > {"client_ip":{"country_code":"US"}} > {"client_ip":{"country_code":"GB"}} > {"client_ip":{"country_code":"US"}} > {"client_ip":{}} > {"client_ip":{}} > > > select p.client_ip.country_code country_code, count(*) event_count > from dfs.tmp.`events.json` as p > group by p.client_ip.country_code > having count(*) > 1; > > +---------------+--------------+ > | country_code | event_count | > +---------------+--------------+ > | US | 4 | > | null | 2 | > +---------------+--------------+ > 2 rows selected (0.285 seconds) > > Looks like the query gets the correct answer. > > > [1] > > http://stackoverflow.com/questions/2068682/why-cant-i-use-alias-in-a-count-column-and-reference-it-in-a-having-clause > > > On Thu, Jul 30, 2015 at 8:40 AM, Stefán Baxter <[email protected]> > wrote: > > > Hi, > > > > event_count is a column alias for count(*) (without the as). > > > > Regards, > > -Stefan > > > > On Thu, Jul 30, 2015 at 3:32 PM, Jacques Nadeau <[email protected]> > > wrote: > > > > > select p.client_ip.country_code country_code, count(*) event_count from > > > dfs.tmp.`/events` as p where event_count > 30000 group by > > > p.client_ip.country_code; > > > +---------------+--------------+ > > > | country_code | event_count | > > > +---------------+--------------+ > > > +---------------+--------------+ > > > > > > In this situation, a fully-schemaed system would have stated that the > > > event_count column does not exist. In the case of json, we don't know > if > > > you're unlying data has an event_count column. When we actually try to > > get > > > it, we find it doesn't exist and is therefore null. Null is never > > greater > > > than 30000 so we exclude all records. Modify your data to look like > > this: > > > > > > {event_count: 30001, "client_ip":{"country_code":"US"}} > > > {event_count: 30001, "client_ip":{"country_code":"US"}} > > > {event_count: 30001, "client_ip":{"country_code":"US"}} > > > {event_count: 30001, "client_ip":{"country_code":"GB"}} > > > {event_count: 30001, "client_ip":{"country_code":"US"}} > > > > > > And you'll see what your query is actually doing. In SQL, an aggregate > > > function is applied after the WHERE clause. The only way to apply a > > filter > > > after an aggregate is with the HAVING clause. > > > > > > > > > > > > -- > > > Jacques Nadeau > > > CTO and Co-Founder, Dremio > > > > > > On Thu, Jul 30, 2015 at 8:20 AM, Andries Engelbrecht < > > > [email protected]> wrote: > > > > > > > Last I checked group by and having clauses do not support column > > aliases. > > > > > > > > —Andries > > > > > > > > > > > > > On Jul 30, 2015, at 7:19 AM, Stefán Baxter < > > [email protected]> > > > > wrote: > > > > > > > > > > Hi, > > > > > > > > > > That last case works as expected, sorry, this test data does have > > null > > > > > values for country_code. > > > > > > > > > > That means that I have a working solution but that it would be nice > > if > > > v1 > > > > > (above) would work. > > > > > > > > > > Thank you, > > > > > -Stefán > > > > > > > > > > On Thu, Jul 30, 2015 at 2:15 PM, Stefán Baxter < > > > > [email protected]> > > > > > wrote: > > > > > > > > > >> make that "Count in where or having clause does not work as > > expected" > > > > >> > > > > >> On Thu, Jul 30, 2015 at 2:14 PM, Stefán Baxter < > > > > [email protected]> > > > > >> wrote: > > > > >> > > > > >>> Hi, > > > > >>> > > > > >>> I have data that can be reduced to this: > > > > >>> > > > > >>> - {"client_ip":{"country_code":"US"}} > > > > >>> - {"client_ip":{"country_code":"US"}} > > > > >>> - {"client_ip":{"country_code":"US"}} > > > > >>> - {"client_ip":{"country_code":"GB"}} > > > > >>> - {"client_ip":{"country_code":"US"}} > > > > >>> > > > > >>> This works fine: > > > > >>> > > > > >>> select p.client_ip.country_code country_code, count(*) > event_count > > > from > > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code; > > > > >>> > > > > >>> > > > > >>> +---------+--------------+ > > > > >>> | country_code | event_count | > > > > >>> +---------+--------------+ > > > > >>> | US | 21516 | > > > > >>> | GB | 323594 | > > > > >>> +---------+--------------+ > > > > >>> > > > > >>> > > > > >>> This simple where clause returns nothing: > > > > >>> > > > > >>> select p.client_ip.country_code country_code, count(*) > event_count > > > from > > > > >>> dfs.tmp.`/events` as p where event_count > 30000 group by > > > > >>> p.client_ip.country_code; > > > > >>> +---------------+--------------+ > > > > >>> | country_code | event_count | > > > > >>> +---------------+--------------+ > > > > >>> +---------------+--------------+ > > > > >>> > > > > >>> > > > > >>> Using a having clause (v1): > > > > >>> > > > > >>> select p.client_ip.country_code country_code, count(*) > event_count > > > from > > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having > > > > event_count > > > > >>>> 30000; > > > > >>> > > > > >>> Error: PARSE ERROR: From line 1, column 169 to line 1, column > 179: > > > > >>> Expression 'event_count' is not being grouped > > > > >>> > > > > >>> > > > > >>> Using a having clause (v2): > > > > >>> > > > > >>> select p.client_ip.country_code country_code, count(*) > event_count > > > from > > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having > > > > count(*) > > > > > >>> 30000; > > > > >>> +---------------+--------------+ > > > > >>> | country_code | event_count | > > > > >>> +---------------+--------------+ > > > > >>> | GB | 323594 | > > > > >>> | null* | 566667 | > > > > >>> +---------------+--------------+ > > > > >>> > > > > >>> > > > > >>> * there are no null values in the database and this looks to be > the > > > > >>> "total for the rest" > > > > >>> > > > > >>> > > > > >>> Does anyone know how this can be made to work? > > > > >>> > > > > >>> Regards, > > > > >>> -Stefan > > > > >>> > > > > >> > > > > >> > > > > > > > > > > > > > >
