[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126973#comment-16126973 ] Rui Li commented on HIVE-17287: --- [~kellyzly], group by w/ rollup and group by w/o rollup are different queries and it's normal that the shuffle read is different for different queries. I guess some of the group keys are skewed. And it'll also be good to verify whether HoS can properly handle group by w/ rollup. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: compare_groupby_groupby_rollup.png, > not_stages_completed_but_job_completed.PNG, query67-fail-at-groupby.png, > query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126964#comment-16126964 ] liyunzhang_intel commented on HIVE-17287: - [~lirui] ,[~gopalv]: some update about skewed data group by. the problem happens when using group by with rollup. in [attached pic|https://issues.apache.org/jira/secure/attachment/12881892/compare_groupby_groupby_rollup.png], you can see the difference, if using group by with roll up, the shuffle read data is 96.8g while using group by, the shuffle read data is 68.1g. when i only using group by, the shuffle read metrics is very even. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: compare_groupby_groupby_rollup.png, > not_stages_completed_but_job_completed.PNG, query67-fail-at-groupby.png, > query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125415#comment-16125415 ] liyunzhang_intel commented on HIVE-17287: - [~lirui]: When i viewed all tasks, saw that 1 task was still running. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: not_stages_completed_but_job_completed.PNG, > query67-fail-at-groupby.png, query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125371#comment-16125371 ] Rui Li commented on HIVE-17287: --- [~kellyzly], disabling {{spark.shuffle.reduceLocality.enabled}} can make reduce tasks more evenly distributed among executors. But as you said, it can't solve the skew in data. As to the screenshot, I don't know the cause either. What if you look at the metrics of all the tasks? Are they all in complete state? > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: not_stages_completed_but_job_completed.PNG, > query67-fail-at-groupby.png, query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125326#comment-16125326 ] liyunzhang_intel commented on HIVE-17287: - [~lirui]: in current case, i have not set {{hive.spark.use.groupby.shuffle}}, but i think the value is true because the default value is true. After disabling {{spark.shuffle.reduceLocality.enabled}}, i reran the query67. It showed passed. But one strange thing i found is [not all stages finished but the result of spark job is completed|https://issues.apache.org/jira/secure/attachment/12881692/not_stages_completed_but_job_completed.PNG].I don't know whether this is bug of spark. Meanwhile in spark history server, the shuffle read metrics are still skewed. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: not_stages_completed_but_job_completed.PNG, > query67-fail-at-groupby.png, query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125156#comment-16125156 ] Rui Li commented on HIVE-17287: --- The groupByKey shuffle uses unbounded memory. You can set {{hive.spark.use.groupby.shuffle=false}} to use MR shuffle instead. By default the config is true. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: query67-fail-at-groupby.png, > query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125155#comment-16125155 ] liyunzhang_intel commented on HIVE-17287: - [~lirui]: bq.Have you tried hive.spark.use.groupby.shuffle? I think it can avoid unbounded mem usage. I have not enabled {{hive.spark.use.groupby.shuffle}} in my cluster. Will try this configuration later. But why in HiveConf it says "Spark groupByKey transformation has better performance but uses unbounded memory". Will this use unbounded memory? bq.For the error you mentioned, I usually disable yarn.nodemanager.pmem-check-enabled as a workaround. have disabled this configuration in my cluster but error still occurred. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: query67-fail-at-groupby.png, > query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125130#comment-16125130 ] Rui Li commented on HIVE-17287: --- Have you tried {{hive.spark.use.groupby.shuffle}}? I think it can avoid unbounded mem usage. For the error you mentioned, I usually disable {{yarn.nodemanager.pmem-check-enabled}} as a workaround. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: query67-fail-at-groupby.png, > query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125125#comment-16125125 ] liyunzhang_intel commented on HIVE-17287: - [~xuefuz]: the memory related error is {noformat} Container killed by YARN for exceeding memory limits. 36.1 GB of 33 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {noformat} It showed it exceeded the memory assigned to the task. I can increase the value of spark.yarn.executor.memoryOverhead. But i guess even i increase the value, the error will appear again as the problem is the key is not even for some task in group by operation. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: query67-fail-at-groupby.png, > query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124263#comment-16124263 ] Xuefu Zhang commented on HIVE-17287: [~kellyzly], just curious, what error did you get for the failed tasks? Memory related? > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: query67-fail-at-groupby.png, > query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123204#comment-16123204 ] Rui Li commented on HIVE-17287: --- [~kellyzly], I mean you can check each of the group keys to see how they are skewed. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: query67-fail-at-groupby.png, > query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123123#comment-16123123 ] liyunzhang_intel commented on HIVE-17287: - [~lirui] : bq.You can run some statistics on the group key to confirm not very understand, you mean "add select count(i_category), i_category, to see the number of every key"? bq. what will the metrics look like if you enable hive.groupby.skewindata? before i enabled {{hive.groupby.skewindata}}, still hangs on the group by stage after sending the data randomly. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: query67-fail-at-groupby.png, > query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123036#comment-16123036 ] Rui Li commented on HIVE-17287: --- OK that seems a skewed shuffle to me. You can run some statistics on the group key to confirm, in case there's some issue like HIVE-17114. Besides, what will the metrics look like if you enable {{hive.groupby.skewindata}}? That optimization will shuffle twice for the group by. The 1st shuffle is partitioned randomly. You can verify it in the explain: {noformat} Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: rand() (type: double) Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE {noformat} > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: query67-fail-at-groupby.png, > query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123021#comment-16123021 ] liyunzhang_intel commented on HIVE-17287: - [~lirui]: attached is the [query67-groupby_shuffle_metric.png|https://issues.apache.org/jira/secure/attachment/12881418/query67-groupby_shuffle_metric.png] of group by stage. It seems that the shuffle read metrics is not even. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: query67-fail-at-groupby.png, > query67-groupby_shuffle_metric.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122994#comment-16122994 ] Rui Li commented on HIVE-17287: --- To determine whether a shuffle is skewed, you need to look at the shuffle input metrics of tasks in that stage. In your case, that should be the tasks following the map-join. bq. the 5 tasks without any records will also be sent to next stage If the tasks have no output, then the downstream tasks won't fetch anything from these tasks. I'm not sure what you mean by "sent to next stage"? > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: query67-fail-at-groupby.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122984#comment-16122984 ] Rui Li commented on HIVE-17287: --- That config is enabled by default in 2.0: https://github.com/apache/spark/blob/v2.0.0/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L274 > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: query67-fail-at-groupby.png > > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122955#comment-16122955 ] liyunzhang_intel commented on HIVE-17287: - [~lirui]: thanks for comments {{spark.shuffle.reduceLocality.enabled}} is [disabled|https://issues.apache.org/jira/browse/SPARK-10087] in spark1.5 so I guess this value is false in my cluster because i use spark2.0. {quote} 11 map-join tasks will output data to be shuffled again. Only 6 tasks have data to output, and the other 5 tasks don't output because no records are generated by the map join.However, this doesn't mean the following shuffle is necessarily skewed. {quote} Yes, this does not mean the group by key is skewed. 1 thing i need to confirm with you is that the 5 tasks without any records will also be sent to next stage(groupby stage) even there is no record. That's why in spark history server some tasks spends nearly 0 seconds to finish while others spends several minutes to finish in groupby stage. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122929#comment-16122929 ] Rui Li commented on HIVE-17287: --- Hi [~kellyzly], I'm trying to understand how the group by is skewed. If you do group by after the map join, then the 11 map-join tasks will output data to be shuffled again. Only 6 tasks have data to output, and the other 5 tasks don't output because no records are generated by the map join. However, this doesn't mean the following shuffle is necessarily skewed. E.g. if you have 100 downstream tasks, they all can fetch from the 6 upstream tasks, as long as the grouping key is evenly distributed. So have you verified whether the group key is skewed? It'll be strange if the key is not skewed but the shuffle is. One possible reason is spark can somehow give reduce tasks location preference which may affect the case you described, you can trying setting {{spark.shuffle.reduceLocality.enabled=false}} to disable it. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122897#comment-16122897 ] liyunzhang_intel commented on HIVE-17287: - [~gopalv],[~lirui]: the result why the output of join is skewed is because I convert all join to map join. In following query, fact table is store_sales and dimension tables are date_dim,store and item. The total size of date_dim, store and item is smaller than the {{hive.auto.convert.join.noconditionaltask.size}}. Hive starts 11 map works to read store_sales and do map join. There is possibility that there is no records in one map work because no match data in other dimension tables with store_sales. {code} select i_category ,i_class ,i_brand ,i_product_name ,d_year ,d_qoy ,d_moy ,s_store_id ,store_sales.ss_sold_date_sk ,store_sales.ss_item_sk ,store_sales.ss_store_sk from store_sales ,date_dim ,store ,item where store_sales.ss_sold_date_sk=date_dim.d_date_sk and store_sales.ss_item_sk=item.i_item_sk and store_sales.ss_store_sk = store.s_store_sk and d_month_seq between 1193 and 1193+11; {code} It is reasonable that the result of map join is not even but is there any way to make it even? because it will cause the data assigned to the group by tasks is not even if group by operation follows the map join. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122840#comment-16122840 ] liyunzhang_intel commented on HIVE-17287: - [~gopalv] or [~lirui]: after enable "hive.optimize.ppd", the default partition /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ has been filtered, so will not load this part of data. But the group by is still skewed. Modify tpcds/query67.sql to output the result of join to view the result of join(before group by) is skewed or not {code} set hive.optimize.ppd=true; set spark.app.name="query677.ppd.true"; drop table if exists result_677; create table result_677 stored as TEXTFILE as select i_category ,i_class ,i_brand ,i_product_name ,d_year ,d_qoy ,d_moy ,s_store_id ,store_sales.ss_sold_date_sk ,store_sales.ss_item_sk ,store_sales.ss_store_sk from store_sales ,date_dim ,store ,item where store_sales.ss_sold_date_sk=date_dim.d_date_sk and store_sales.ss_item_sk=item.i_item_sk and store_sales.ss_store_sk = store.s_store_sk and d_month_seq between 1193 and 1193+11; {code} the result is {code} hadoop fs -du -h hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/ 105.5 M hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/00_0 46.8 M hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/01_0 4.0 M hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/02_0 47.4 M hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/03_0 215.1 M hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/04_0 77.7 M hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/05_0 0 hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/06_0 0 hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/07_0 0 hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/08_0 0 hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/09_0 0 hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/10_0 {code} The result of join is skewed. The biggest is 215M while the smallest is 0M. Is there way to make the output of join is even so that the following groupby will not skewed? > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121026#comment-16121026 ] Gopal V commented on HIVE-17287: {code} store_sales.ss_sold_date_sk=date_dim.d_date_sk {code} Implies ss_sold_date_sk is not null. There's another optimizer in Hive called PCR, which removes any Filter which is no-op. So there's some possibility that "explain extended " won't list the default partition at all. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121023#comment-16121023 ] liyunzhang_intel commented on HIVE-17287: - [~gopalv]: thanks for your comments {quote} As little as I know about Hive-on-Spark, Query67 does not read any row from the default partition in MR or Tez. {quote} the default_partition stores the data(25.7G) which ss_sold_date_sk is null. In MR/Tez, these part of data will not be load because there is no match data in join? {quote} I suspect HoS is loading each partition as an independent RDD, which removes the effect of SemanticAnalyzer::genNotNullFilterForJoinSourcePlan()? {quote} I also see the filter to filter null data in the explain, although hos load partition as an independent RDD, i think the filter should work in theory. part of explain of query67 {code} Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_store_sk is not null and ss_item_sk is not null) (type: boolean) Statistics: Num rows: 8251124389 Data size: 181524736558 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (ss_store_sk is not null and ss_item_sk is not null) (type: boolean) Statistics: Num rows: 8251124389 Data size: 181524736558 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: ss_item_sk (type: bigint), ss_store_sk (type: bigint), ss_quantity (type: int), ss_sales_price (type: double), ss_sold_date_sk (type: bigint) outputColumnNames: _col0, _col1, _col2, _col3, _col4 Statistics: Num rows: 8251124389 Data size: 181524736558 Basic stats: COMPLETE Column stats: NONE Map Join Operator {code} > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121021#comment-16121021 ] Gopal V commented on HIVE-17287: bq. these part of data will not be load because there is no match data in join? Yes. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121018#comment-16121018 ] hsj commented on HIVE-17287: [~gopalv]: thanks for your comments {quote} As little as I know about Hive-on-Spark, Query67 does not read any row from the default partition in MR or Tez. {quote} the default_partition stores the data(25.7G) which ss_sold_date_sk is null. In MR/Tez, these part of data will not be load because there is no match data in join? > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121004#comment-16121004 ] Gopal V commented on HIVE-17287: [~ liyunzhang_intel]: I suspect HoS is loading each partition as an independent RDD, which removes the effect of SemanticAnalyzer::genNotNullFilterForJoinSourcePlan()? As little as I know about Hive-on-Spark, Query67 does not read any row from the default partition in MR or Tez. > HoS can not deal with skewed data group by > -- > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)