[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451620#comment-17451620 ] zhangyangyang commented on SPARK-35332: --- https://issues.apache.org/jira/browse/MAPREDUCE-6944, hello, please help me for the last comment. Thank you very much > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Assignee: XiDuo You >Priority: Major > Fix For: 3.2.0 > > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351426#comment-17351426 ] Xianghao Lu commented on SPARK-35332: - Great, thank you very much for your work [~ulysses] > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Assignee: XiDuo You >Priority: Major > Fix For: 3.2.0 > > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344501#comment-17344501 ] XiDuo You commented on SPARK-35332: --- [~luxianghao] Now you can `set spark.sql.optimizer.canChangeCachedPlanOutputPartitioning = true` to make AQE optimize the cache plan. > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Assignee: XiDuo You >Priority: Major > Fix For: 3.2.0 > > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344218#comment-17344218 ] Apache Spark commented on SPARK-35332: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/32543 > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Assignee: XiDuo You >Priority: Major > Fix For: 3.2.0 > > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344217#comment-17344217 ] Apache Spark commented on SPARK-35332: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/32543 > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Assignee: XiDuo You >Priority: Major > Fix For: 3.2.0 > > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341455#comment-17341455 ] Apache Spark commented on SPARK-35332: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/32482 > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Priority: Major > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341454#comment-17341454 ] Apache Spark commented on SPARK-35332: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/32482 > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Priority: Major > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341313#comment-17341313 ] Takeshi Yamamuro commented on SPARK-35332: -- okay, sgtm. cc: [~cloud_fan] could you make a PR for that? Let's keep discussing it by referring to an implementation. > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Priority: Major > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341244#comment-17341244 ] XiDuo You commented on SPARK-35332: --- Adding a new cache-specific option in a CACHE statement seems a little bit complex(support and use) and not works for the Dataset api (that means we need change two code place). Like other session-wide configs, if user want to change some behavior in one session for different query, they need to modify the config manually. So I think it's easy to support and use with a session-wide config. > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Priority: Major > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341233#comment-17341233 ] Takeshi Yamamuro commented on SPARK-35332: -- Yea, right. As [~ulysses] said, that's because the cache mechanism forcibly disables some optimisations that can change output partitions implicitly. I'm not sure that adding a new session-wide SQL config is a good option because how a cache is referenced depends on a user's usecase; for example, the output partitioning of some caches may not matter but that of the other caches may pretty matter, etc. As another idea, how about adding a new cache-specific option in a CACHE statement? https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L227-L228 > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Priority: Major > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341107#comment-17341107 ] XiDuo You commented on SPARK-35332: --- The reason is Spark force disable the AQE during executing the cached plan that is for reusing the output partitioning of cached plan. IMO we can make these disable as a new SQL config so that user can get a choice. also cc [~maropu] > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Priority: Major > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340596#comment-17340596 ] Xianghao Lu commented on SPARK-35332: - cc [~maryannxue] [~cloud_fan] > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Priority: Major > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org