[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2023-06-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736041#comment-17736041
 ] 

ASF GitHub Bot commented on SPARK-38230:


User 'jeanlyn' has created a pull request for this issue:
https://github.com/apache/spark/pull/41628

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.3.0, 3.4.0, 3.5.0
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2023-06-20 Thread jeanlyn (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735519#comment-17735519
 ] 

jeanlyn commented on SPARK-38230:
-

We found Hive metastore crash frequently after upgrade Spark from 2.4.7 to 
3.3.2. After investigation, I found `InsertIntoHadoopFsRelationCommand` will 
pull all partitions when using dynamicPartitionOverwrite, and i find this issue 
after solves the problem by using generate paths to get partitions to get 
partitions in our environment. So, I have submitted a new pull request, hoping 
to help you.

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.3.0, 3.4.0, 3.5.0
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2023-01-17 Thread Gabor Roczei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677683#comment-17677683
 ] 

Gabor Roczei commented on SPARK-38230:
--

Hi [~ximz],

> [~roczei] Can you please review the PR and let me know if I missed anything? 
>Thank you.

I will try to allocate some time for this next week. 

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2023-01-15 Thread Xiaomin Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677162#comment-17677162
 ] 

Xiaomin Zhang commented on SPARK-38230:
---

Hello [~coalchan] Thanks for working on this.  I created PR based on your work 
with some improvements as per [~Jackey Lee]'s comment. Now we don't need a new 
parameter and Spark will only invoke listPartitions for the case of overwriting 
hive static partitions.
[~roczei] Can you please review the PR and let me know if I missed anything? 
Thank you.

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2023-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677159#comment-17677159
 ] 

Apache Spark commented on SPARK-38230:
--

User 'czxm' has created a pull request for this issue:
https://github.com/apache/spark/pull/39595

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2022-11-22 Thread Gabor Roczei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637096#comment-17637096
 ] 

Gabor Roczei commented on SPARK-38230:
--

Hi [~coalchan],

[Your pull request|https://github.com/apache/spark/pull/35549] has been 
automatically closed by the github action, I would like to create a new pull 
request based on yours and continue to work on this if you agree.
 

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493791#comment-17493791
 ] 

Apache Spark commented on SPARK-38230:
--

User 'coalchan' has created a pull request for this issue:
https://github.com/apache/spark/pull/35549

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493790#comment-17493790
 ] 

Apache Spark commented on SPARK-38230:
--

User 'coalchan' has created a pull request for this issue:
https://github.com/apache/spark/pull/35549

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org