Robin,
I was interested so tried a quick repro.  Yes I see the behavior too.

Environment: hive-0.9.0 on windows.
Hive> CREATE TABLE ptable(val int) PARTITIONED BY(year int, month int, day int);
Hive> LOAD DATA LOCAL INPATH d:/data/2013-01-01.data.txt INTO TABLE ptable 
PARTITION(2013,1,1);
// the txt file has separate lines with values 1,2,3,4.
Hive> select max(day) from ptable;
..
OK
NULL
Hive> select avg(day) from ptable;
OK
2.5

Other normal queries work fine. I didn't try other aggregations.

-mike.
-----Original Message-----
From: Robin Morris [mailto:r...@baynote.com] 
Sent: Thursday, April 18, 2013 1:15 PM
To: user@hive.apache.org
Subject: max() returns NULL on partition column

partitioned_table is partitioned on year, month, day.

A query of the form
> select max(day) from partitioned_table where year=2013 and month=4;
spins up zero mappers, one reducer, and returns NULL

Hadoop job information for Stage-1: number of mappers: 0; number of
reducers: 1
2013-04-18 12:35:57,842 Stage-1 map = 0%,  reduce = 0%
2013-04-18 12:36:00,855 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU
1.36 sec
2013-04-18 12:36:01,860 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU
1.36 sec
2013-04-18 12:36:02,865 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
1.36 sec
MapReduce Total cumulative CPU time: 1 seconds 360 msec Ended Job = 
job_201302011159_205031 MapReduce Jobs Launched:
Job 0: Reduce: 1   Accumulative CPU: 1.36 sec   HDFS Read: 0 HDFS Write: 3
SUCESS
Total MapReduce CPU Time Spent: 1 seconds 360 msec OK NULL Time taken: 8.351 
seconds



Trying to work round it by doing something like select max(day) from (select 
day from partitioned_table where year=2013 and month=4)foo; doesn't work 
either, I'm guessing because the optimizer pushes down the
max() into the subquery, resulting in the same query as above.


Using an explicit intermediate table does work
> create table foo_max as select day from partitioned_table where
>year=2013 and month=4;  select max(day) from foo_max; drop table 
>foo_max;
Several map-reduce jobs later, the correct answer (of 18) is given.


distinct() has similar behavior, except that rather than returning NULL, it 
returns nothing.
min() returns NULL
avg() works!


If someone else can reproduce this, I'll submit a bug.

Robin




Reply via email to