max() returns NULL on partition column

Robin Morris Thu, 18 Apr 2013 13:15:04 -0700

partitioned_table is partitioned on year, month, day.

A query of the form
> select max(day) from partitioned_table where year=2013 and month=4;
spins up zero mappers, one reducer, and returns NULL


Hadoop job information for Stage-1: number of mappers: 0; number of
reducers: 1
2013-04-18 12:35:57,842 Stage-1 map = 0%,  reduce = 0%
2013-04-18 12:36:00,855 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU
1.36 sec
2013-04-18 12:36:01,860 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU
1.36 sec
2013-04-18 12:36:02,865 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
1.36 sec
MapReduce Total cumulative CPU time: 1 seconds 360 msec
Ended Job = job_201302011159_205031
MapReduce Jobs Launched:
Job 0: Reduce: 1   Accumulative CPU: 1.36 sec   HDFS Read: 0 HDFS Write: 3
SUCESS
Total MapReduce CPU Time Spent: 1 seconds 360 msec
OK
NULL
Time taken: 8.351 seconds



Trying to work round it by doing something like
select max(day) from (select day from partitioned_table where year=2013
and month=4)foo;
doesn't work either, I'm guessing because the optimizer pushes down the
max() into the subquery, resulting in the same query as above.


Using an explicit intermediate table does work
> create table foo_max as select day from partitioned_table where
>year=2013 and month=4;  select max(day) from foo_max; drop table foo_max;
Several map-reduce jobs later, the correct answer (of 18) is given.


distinct() has similar behavior, except that rather than returning NULL,
it returns nothing.
min() returns NULL
avg() works!


If someone else can reproduce this, I'll submit a bug.

Robin

max() returns NULL on partition column

Reply via email to