This is related to using alias in having clause [1]. Drill does not allow
using alias in having clause.
Q1.
select p.client_ip.country_code country_code, count(*) event_count from
dfs.tmp.`/events` as p where event_count > 30000 group by
p.client_ip.country_code;
In Q1, the event-count in the where clause is not refer to the alias
"event_count". In stead, it is resolved to refer to a column from the table
"events". In the table, "event_count" does not exists, and is treated as
null. That's why you get 0 rows.
Q2.
select p.client_ip.country_code country_code, count(*) event_count from
dfs.tmp.`/events` as p group by p.client_ip.country_code having event_count
> 30000;
Again, "event_count" in having clause is not resolved to alias
"event_count" in select list. That's why SQL validation fail.
Q3
select p.client_ip.country_code country_code, count(*) event_count from
dfs.tmp.`/events` as p group by p.client_ip.country_code having count(*) >
30000;
This is the right approach.
Regarding the null value and the total count, I tried with a slightly
modified input ( adding two rows with country_code as null).
{"client_ip":{"country_code":"US"}}
{"client_ip":{"country_code":"US"}}
{"client_ip":{"country_code":"US"}}
{"client_ip":{"country_code":"GB"}}
{"client_ip":{"country_code":"US"}}
{"client_ip":{}}
{"client_ip":{}}
select p.client_ip.country_code country_code, count(*) event_count
from dfs.tmp.`events.json` as p
group by p.client_ip.country_code
having count(*) > 1;
+---------------+--------------+
| country_code | event_count |
+---------------+--------------+
| US | 4 |
| null | 2 |
+---------------+--------------+
2 rows selected (0.285 seconds)
Looks like the query gets the correct answer.
[1]
http://stackoverflow.com/questions/2068682/why-cant-i-use-alias-in-a-count-column-and-reference-it-in-a-having-clause
On Thu, Jul 30, 2015 at 8:40 AM, Stefán Baxter <[email protected]>
wrote:
> Hi,
>
> event_count is a column alias for count(*) (without the as).
>
> Regards,
> -Stefan
>
> On Thu, Jul 30, 2015 at 3:32 PM, Jacques Nadeau <[email protected]>
> wrote:
>
> > select p.client_ip.country_code country_code, count(*) event_count from
> > dfs.tmp.`/events` as p where event_count > 30000 group by
> > p.client_ip.country_code;
> > +---------------+--------------+
> > | country_code | event_count |
> > +---------------+--------------+
> > +---------------+--------------+
> >
> > In this situation, a fully-schemaed system would have stated that the
> > event_count column does not exist. In the case of json, we don't know if
> > you're unlying data has an event_count column. When we actually try to
> get
> > it, we find it doesn't exist and is therefore null. Null is never
> greater
> > than 30000 so we exclude all records. Modify your data to look like
> this:
> >
> > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > {event_count: 30001, "client_ip":{"country_code":"GB"}}
> > {event_count: 30001, "client_ip":{"country_code":"US"}}
> >
> > And you'll see what your query is actually doing. In SQL, an aggregate
> > function is applied after the WHERE clause. The only way to apply a
> filter
> > after an aggregate is with the HAVING clause.
> >
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Thu, Jul 30, 2015 at 8:20 AM, Andries Engelbrecht <
> > [email protected]> wrote:
> >
> > > Last I checked group by and having clauses do not support column
> aliases.
> > >
> > > —Andries
> > >
> > >
> > > > On Jul 30, 2015, at 7:19 AM, Stefán Baxter <
> [email protected]>
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > That last case works as expected, sorry, this test data does have
> null
> > > > values for country_code.
> > > >
> > > > That means that I have a working solution but that it would be nice
> if
> > v1
> > > > (above) would work.
> > > >
> > > > Thank you,
> > > > -Stefán
> > > >
> > > > On Thu, Jul 30, 2015 at 2:15 PM, Stefán Baxter <
> > > [email protected]>
> > > > wrote:
> > > >
> > > >> make that "Count in where or having clause does not work as
> expected"
> > > >>
> > > >> On Thu, Jul 30, 2015 at 2:14 PM, Stefán Baxter <
> > > [email protected]>
> > > >> wrote:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> I have data that can be reduced to this:
> > > >>>
> > > >>> - {"client_ip":{"country_code":"US"}}
> > > >>> - {"client_ip":{"country_code":"US"}}
> > > >>> - {"client_ip":{"country_code":"US"}}
> > > >>> - {"client_ip":{"country_code":"GB"}}
> > > >>> - {"client_ip":{"country_code":"US"}}
> > > >>>
> > > >>> This works fine:
> > > >>>
> > > >>> select p.client_ip.country_code country_code, count(*) event_count
> > from
> > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code;
> > > >>>
> > > >>>
> > > >>> +---------+--------------+
> > > >>> | country_code | event_count |
> > > >>> +---------+--------------+
> > > >>> | US | 21516 |
> > > >>> | GB | 323594 |
> > > >>> +---------+--------------+
> > > >>>
> > > >>>
> > > >>> This simple where clause returns nothing:
> > > >>>
> > > >>> select p.client_ip.country_code country_code, count(*) event_count
> > from
> > > >>> dfs.tmp.`/events` as p where event_count > 30000 group by
> > > >>> p.client_ip.country_code;
> > > >>> +---------------+--------------+
> > > >>> | country_code | event_count |
> > > >>> +---------------+--------------+
> > > >>> +---------------+--------------+
> > > >>>
> > > >>>
> > > >>> Using a having clause (v1):
> > > >>>
> > > >>> select p.client_ip.country_code country_code, count(*) event_count
> > from
> > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> > > event_count
> > > >>>> 30000;
> > > >>>
> > > >>> Error: PARSE ERROR: From line 1, column 169 to line 1, column 179:
> > > >>> Expression 'event_count' is not being grouped
> > > >>>
> > > >>>
> > > >>> Using a having clause (v2):
> > > >>>
> > > >>> select p.client_ip.country_code country_code, count(*) event_count
> > from
> > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> > > count(*) >
> > > >>> 30000;
> > > >>> +---------------+--------------+
> > > >>> | country_code | event_count |
> > > >>> +---------------+--------------+
> > > >>> | GB | 323594 |
> > > >>> | null* | 566667 |
> > > >>> +---------------+--------------+
> > > >>>
> > > >>>
> > > >>> * there are no null values in the database and this looks to be the
> > > >>> "total for the rest"
> > > >>>
> > > >>>
> > > >>> Does anyone know how this can be made to work?
> > > >>>
> > > >>> Regards,
> > > >>> -Stefan
> > > >>>
> > > >>
> > > >>
> > >
> > >
> >
>