Good to know now it works. Regarding the support of alias in having clause, we had some discussion before. In a schema-aware system, user normally would see the query using the alias fails at validation, since the alias column does not exist in the table. This will prevent them from getting "surprising" result, just as what you got.
In a schema-less system like Drill, it is more likely to encounter this kind of "surprising" problem, since Drill will simply assume the "event_count" reference is a regular column from the table. We probably should enhance the SQL planner to add the support of alias in having clause. On Thu, Jul 30, 2015 at 9:16 AM, Stefán Baxter <[email protected]> wrote: > thank you, I got it :). > > Completely understanding the Having clause "limitation" and just missing > the column aliasing a bit. > When I looked at the null results I was so sure that there were no nulls in > the data-set that I did not bother to double check and just wrote the > email. > > Thanks for the assistance, > -Stefan > > > On Thu, Jul 30, 2015 at 4:11 PM, Jinfeng Ni <[email protected]> wrote: > > > This is related to using alias in having clause [1]. Drill does not > allow > > using alias in having clause. > > > > > > Q1. > > select p.client_ip.country_code country_code, count(*) event_count from > > dfs.tmp.`/events` as p where event_count > 30000 group by > > p.client_ip.country_code; > > > > In Q1, the event-count in the where clause is not refer to the alias > > "event_count". In stead, it is resolved to refer to a column from the > table > > "events". In the table, "event_count" does not exists, and is treated as > > null. That's why you get 0 rows. > > > > Q2. > > select p.client_ip.country_code country_code, count(*) event_count from > > dfs.tmp.`/events` as p group by p.client_ip.country_code having > event_count > > > 30000; > > > > Again, "event_count" in having clause is not resolved to alias > > "event_count" in select list. That's why SQL validation fail. > > > > > > Q3 > > select p.client_ip.country_code country_code, count(*) event_count from > > dfs.tmp.`/events` as p group by p.client_ip.country_code having count(*) > > > > 30000; > > > > This is the right approach. > > > > Regarding the null value and the total count, I tried with a slightly > > modified input ( adding two rows with country_code as null). > > > > {"client_ip":{"country_code":"US"}} > > {"client_ip":{"country_code":"US"}} > > {"client_ip":{"country_code":"US"}} > > {"client_ip":{"country_code":"GB"}} > > {"client_ip":{"country_code":"US"}} > > {"client_ip":{}} > > {"client_ip":{}} > > > > > > select p.client_ip.country_code country_code, count(*) event_count > > from dfs.tmp.`events.json` as p > > group by p.client_ip.country_code > > having count(*) > 1; > > > > +---------------+--------------+ > > | country_code | event_count | > > +---------------+--------------+ > > | US | 4 | > > | null | 2 | > > +---------------+--------------+ > > 2 rows selected (0.285 seconds) > > > > Looks like the query gets the correct answer. > > > > > > [1] > > > > > http://stackoverflow.com/questions/2068682/why-cant-i-use-alias-in-a-count-column-and-reference-it-in-a-having-clause > > > > > > On Thu, Jul 30, 2015 at 8:40 AM, Stefán Baxter < > [email protected]> > > wrote: > > > > > Hi, > > > > > > event_count is a column alias for count(*) (without the as). > > > > > > Regards, > > > -Stefan > > > > > > On Thu, Jul 30, 2015 at 3:32 PM, Jacques Nadeau <[email protected]> > > > wrote: > > > > > > > select p.client_ip.country_code country_code, count(*) event_count > from > > > > dfs.tmp.`/events` as p where event_count > 30000 group by > > > > p.client_ip.country_code; > > > > +---------------+--------------+ > > > > | country_code | event_count | > > > > +---------------+--------------+ > > > > +---------------+--------------+ > > > > > > > > In this situation, a fully-schemaed system would have stated that the > > > > event_count column does not exist. In the case of json, we don't > know > > if > > > > you're unlying data has an event_count column. When we actually try > to > > > get > > > > it, we find it doesn't exist and is therefore null. Null is never > > > greater > > > > than 30000 so we exclude all records. Modify your data to look like > > > this: > > > > > > > > {event_count: 30001, "client_ip":{"country_code":"US"}} > > > > {event_count: 30001, "client_ip":{"country_code":"US"}} > > > > {event_count: 30001, "client_ip":{"country_code":"US"}} > > > > {event_count: 30001, "client_ip":{"country_code":"GB"}} > > > > {event_count: 30001, "client_ip":{"country_code":"US"}} > > > > > > > > And you'll see what your query is actually doing. In SQL, an > aggregate > > > > function is applied after the WHERE clause. The only way to apply a > > > filter > > > > after an aggregate is with the HAVING clause. > > > > > > > > > > > > > > > > -- > > > > Jacques Nadeau > > > > CTO and Co-Founder, Dremio > > > > > > > > On Thu, Jul 30, 2015 at 8:20 AM, Andries Engelbrecht < > > > > [email protected]> wrote: > > > > > > > > > Last I checked group by and having clauses do not support column > > > aliases. > > > > > > > > > > —Andries > > > > > > > > > > > > > > > > On Jul 30, 2015, at 7:19 AM, Stefán Baxter < > > > [email protected]> > > > > > wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > That last case works as expected, sorry, this test data does have > > > null > > > > > > values for country_code. > > > > > > > > > > > > That means that I have a working solution but that it would be > nice > > > if > > > > v1 > > > > > > (above) would work. > > > > > > > > > > > > Thank you, > > > > > > -Stefán > > > > > > > > > > > > On Thu, Jul 30, 2015 at 2:15 PM, Stefán Baxter < > > > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > >> make that "Count in where or having clause does not work as > > > expected" > > > > > >> > > > > > >> On Thu, Jul 30, 2015 at 2:14 PM, Stefán Baxter < > > > > > [email protected]> > > > > > >> wrote: > > > > > >> > > > > > >>> Hi, > > > > > >>> > > > > > >>> I have data that can be reduced to this: > > > > > >>> > > > > > >>> - {"client_ip":{"country_code":"US"}} > > > > > >>> - {"client_ip":{"country_code":"US"}} > > > > > >>> - {"client_ip":{"country_code":"US"}} > > > > > >>> - {"client_ip":{"country_code":"GB"}} > > > > > >>> - {"client_ip":{"country_code":"US"}} > > > > > >>> > > > > > >>> This works fine: > > > > > >>> > > > > > >>> select p.client_ip.country_code country_code, count(*) > > event_count > > > > from > > > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code; > > > > > >>> > > > > > >>> > > > > > >>> +---------+--------------+ > > > > > >>> | country_code | event_count | > > > > > >>> +---------+--------------+ > > > > > >>> | US | 21516 | > > > > > >>> | GB | 323594 | > > > > > >>> +---------+--------------+ > > > > > >>> > > > > > >>> > > > > > >>> This simple where clause returns nothing: > > > > > >>> > > > > > >>> select p.client_ip.country_code country_code, count(*) > > event_count > > > > from > > > > > >>> dfs.tmp.`/events` as p where event_count > 30000 group by > > > > > >>> p.client_ip.country_code; > > > > > >>> +---------------+--------------+ > > > > > >>> | country_code | event_count | > > > > > >>> +---------------+--------------+ > > > > > >>> +---------------+--------------+ > > > > > >>> > > > > > >>> > > > > > >>> Using a having clause (v1): > > > > > >>> > > > > > >>> select p.client_ip.country_code country_code, count(*) > > event_count > > > > from > > > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having > > > > > event_count > > > > > >>>> 30000; > > > > > >>> > > > > > >>> Error: PARSE ERROR: From line 1, column 169 to line 1, column > > 179: > > > > > >>> Expression 'event_count' is not being grouped > > > > > >>> > > > > > >>> > > > > > >>> Using a having clause (v2): > > > > > >>> > > > > > >>> select p.client_ip.country_code country_code, count(*) > > event_count > > > > from > > > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having > > > > > count(*) > > > > > > >>> 30000; > > > > > >>> +---------------+--------------+ > > > > > >>> | country_code | event_count | > > > > > >>> +---------------+--------------+ > > > > > >>> | GB | 323594 | > > > > > >>> | null* | 566667 | > > > > > >>> +---------------+--------------+ > > > > > >>> > > > > > >>> > > > > > >>> * there are no null values in the database and this looks to be > > the > > > > > >>> "total for the rest" > > > > > >>> > > > > > >>> > > > > > >>> Does anyone know how this can be made to work? > > > > > >>> > > > > > >>> Regards, > > > > > >>> -Stefan > > > > > >>> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > >
