partitioned_table is partitioned on year, month, day. A query of the form > select max(day) from partitioned_table where year=2013 and month=4; spins up zero mappers, one reducer, and returns NULL
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1 2013-04-18 12:35:57,842 Stage-1 map = 0%, reduce = 0% 2013-04-18 12:36:00,855 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 1.36 sec 2013-04-18 12:36:01,860 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 1.36 sec 2013-04-18 12:36:02,865 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.36 sec MapReduce Total cumulative CPU time: 1 seconds 360 msec Ended Job = job_201302011159_205031 MapReduce Jobs Launched: Job 0: Reduce: 1 Accumulative CPU: 1.36 sec HDFS Read: 0 HDFS Write: 3 SUCESS Total MapReduce CPU Time Spent: 1 seconds 360 msec OK NULL Time taken: 8.351 seconds Trying to work round it by doing something like select max(day) from (select day from partitioned_table where year=2013 and month=4)foo; doesn't work either, I'm guessing because the optimizer pushes down the max() into the subquery, resulting in the same query as above. Using an explicit intermediate table does work > create table foo_max as select day from partitioned_table where >year=2013 and month=4; select max(day) from foo_max; drop table foo_max; Several map-reduce jobs later, the correct answer (of 18) is given. distinct() has similar behavior, except that rather than returning NULL, it returns nothing. min() returns NULL avg() works! If someone else can reproduce this, I'll submit a bug. Robin