Hello. This is basically the same question I posted on stackoverflow: http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115
I know the query is a bit noisy. But this query also demonstrates the error: select a.symbol from (select symbol, ordertype from cat group by symbol, ordertype) a group by a.symbol; Now, this query may not make much sense but in my case, because I have 24 symbols, I expect a result of 24 rows. But I get 48 rows back. A similar query: select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group by c.Symbol,c.catid) a group by a.Symbol; returns 864 rows, where I still expect 24 rows... If there are alternatives as to how to write the original query in my SO post I would much appreciate hearing them. The examples given in this mail have just been provided to demonstrate the problem using easier to understand queries and I don't need advice on them. The .csv data and example is from a toy example. My real setup is 6 nodes, and the table definition is: create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) row format delimited fields terminated by ',' stored as ORC; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=1000; insert... Thank you so much for any input. /Sincerely Mikael