liyunzhang_intel created HIVE-17287:
---------------------------------------

             Summary: HoS can not deal with skewed data group by
                 Key: HIVE-17287
                 URL: https://issues.apache.org/jira/browse/HIVE-17287
             Project: Hive
          Issue Type: Bug
            Reporter: liyunzhang_intel


In 
[tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql],
 fact table {{store_sales}} joins with small tables {{date_dim}}, 
{{item}},{{store}}. After join, groupby the intermediate data.
Here the data of {{store_sales}} on 3TB tpcds is skewed:  there are 1824 
partitions. The biggest partition is 25.7G and others are 715M.
{code}
hadoop fs -du -h 
/user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales
....
715.0 M  
/user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639
713.9 M  
/user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640
714.1 M  
/user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641
712.9 M  
/user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642
25.7 G   
/user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__
{code}
The skewed table {{store_sales}} caused the failed job. Is there any way to 
solve the groupby problem of skewed table?  I tried to enable 
{{hive.groupby.skewindata}} to first divide the data more evenly then start do 
group by. But the job still hangs. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to