[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-12-01 Thread zhangyangyang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451620#comment-17451620
 ] 

zhangyangyang commented on SPARK-35332:
---

https://issues.apache.org/jira/browse/MAPREDUCE-6944, hello, please help me for 
the last comment. Thank you very much

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-25 Thread Xianghao Lu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351426#comment-17351426
 ] 

Xianghao Lu commented on SPARK-35332:
-

Great, thank you very much for your work [~ulysses]

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-14 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344501#comment-17344501
 ] 

XiDuo You commented on SPARK-35332:
---

[~luxianghao] Now you can `set 
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning = true` to make AQE 
optimize the cache plan.

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344218#comment-17344218
 ] 

Apache Spark commented on SPARK-35332:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32543

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344217#comment-17344217
 ] 

Apache Spark commented on SPARK-35332:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32543

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341455#comment-17341455
 ] 

Apache Spark commented on SPARK-35332:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32482

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341454#comment-17341454
 ] 

Apache Spark commented on SPARK-35332:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32482

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-08 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341313#comment-17341313
 ] 

Takeshi Yamamuro commented on SPARK-35332:
--

okay, sgtm. cc: [~cloud_fan]

could you make a PR for that? Let's keep discussing it by referring to an 
implementation. 

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-08 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341244#comment-17341244
 ] 

XiDuo You commented on SPARK-35332:
---

Adding a new cache-specific option in a CACHE statement seems a little bit 
complex(support and use) and not works for the Dataset api (that means we need 
change two code place). Like other session-wide configs, if user want to change 
some behavior in one session for different query, they need to modify the 
config manually. So I think it's easy to support and use with a session-wide 
config.

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-08 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341233#comment-17341233
 ] 

Takeshi Yamamuro commented on SPARK-35332:
--

Yea, right. As [~ulysses] said, that's because the cache mechanism forcibly 
disables some optimisations that can change output partitions implicitly. I'm 
not sure that adding a new session-wide SQL config  is a good option because 
how a cache is referenced depends on a user's usecase; for example, the output 
partitioning of some caches may not matter but that of the other caches may 
pretty matter, etc. As another idea, how about adding a new cache-specific 
option in a CACHE statement? 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L227-L228

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-07 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341107#comment-17341107
 ] 

XiDuo You commented on SPARK-35332:
---

The reason is Spark force disable the AQE during executing the cached plan that 
is for reusing the output partitioning of cached plan. IMO we can make these 
disable as a new SQL config so that user can get a choice. also cc [~maropu]

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-07 Thread Xianghao Lu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340596#comment-17340596
 ] 

Xianghao Lu commented on SPARK-35332:
-

cc [~maryannxue]  [~cloud_fan]

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org