Duplicate rows when using group by in subquery

Mikael Öhman Mon, 16 Sep 2013 01:25:15 -0700

Hello.

This is basically the same question I posted on stackoverflow: 
http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115


I know the query is a bit noisy. But this query also demonstrates the error:

select a.symbol from (select symbol, ordertype from cat group by symbol, 
ordertype) a group by a.symbol;

Now, this query may not make much sense but in my case, because I have 24 
symbols, I expect a result of 24 rows. But I get 48 rows back. A similar query:

select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group by 
c.Symbol,c.catid) a group by a.Symbol;

returns 864 rows, where I still expect 24 rows... If there are alternatives as 
to how to write the original query in my SO post I would much appreciate 
hearing them. The examples given in this mail have just been provided to 
demonstrate the problem using easier to understand queries and I don't need 
advice on them.

The .csv data and example is from a toy example. My real setup is 6 nodes, and 
the table definition is:

create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, 
INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID 
string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, SRCREPID int, 
TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) row format 
delimited fields terminated by ',' stored as ORC;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;

insert...

Thank you so much for any input.

/Sincerely Mikael

Duplicate rows when using group by in subquery

Reply via email to