[jira] [Issue Comment Deleted] (SPARK-33584) Partition predicate pushdown into Hive metastore support cast string type to date type

2020-11-28 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33584:

Comment: was deleted

(was: This change should after SPARK-33581, I have prepared the pr: 
https://github.com/wangyum/spark/tree/SPARK-33584)

> Partition predicate pushdown into Hive metastore support cast string type to 
> date type
> --
>
> Key: SPARK-33584
> URL: https://issues.apache.org/jira/browse/SPARK-33584
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("create table t1(id string) partitioned by (part string) stored as 
> parquet")
> spark.sql("insert into t1 values('1', '2019-01-01')")
> spark.sql("insert into t1 values('2', '2019-01-02')")
> spark.sql("select * from t1 where  part = date '2019-01-01' ").show
> {code}
> We can pushdown {{cast(part as date) = date '2019-01-01' }} to Hive metastore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-33582) Partition predicate pushdown into Hive metastore support not-equals

2020-11-28 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33582:

Comment: was deleted

(was: This change should after SPARK-33581, I have prepared the pr: 
https://github.com/wangyum/spark/tree/SPARK-33582)

> Partition predicate pushdown into Hive metastore support not-equals
> ---
>
> Key: SPARK-33582
> URL: https://issues.apache.org/jira/browse/SPARK-33582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207
> https://issues.apache.org/jira/browse/HIVE-2702



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33584) Partition predicate pushdown into Hive metastore support cast string type to date type

2020-11-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240174#comment-17240174
 ] 

Apache Spark commented on SPARK-33584:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30535

> Partition predicate pushdown into Hive metastore support cast string type to 
> date type
> --
>
> Key: SPARK-33584
> URL: https://issues.apache.org/jira/browse/SPARK-33584
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("create table t1(id string) partitioned by (part string) stored as 
> parquet")
> spark.sql("insert into t1 values('1', '2019-01-01')")
> spark.sql("insert into t1 values('2', '2019-01-02')")
> spark.sql("select * from t1 where  part = date '2019-01-01' ").show
> {code}
> We can pushdown {{cast(part as date) = date '2019-01-01' }} to Hive metastore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33582) Partition predicate pushdown into Hive metastore support not-equals

2020-11-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240173#comment-17240173
 ] 

Apache Spark commented on SPARK-33582:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30534

> Partition predicate pushdown into Hive metastore support not-equals
> ---
>
> Key: SPARK-33582
> URL: https://issues.apache.org/jira/browse/SPARK-33582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207
> https://issues.apache.org/jira/browse/HIVE-2702



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33581) Refactor HivePartitionFilteringSuite

2020-11-28 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240171#comment-17240171
 ] 

Yuming Wang commented on SPARK-33581:
-

Issue resolved by pull request 30525
https://github.com/apache/spark/pull/30525

> Refactor HivePartitionFilteringSuite
> 
>
> Key: SPARK-33581
> URL: https://issues.apache.org/jira/browse/SPARK-33581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Refactor HivePartitionFilteringSuite, to make it easy to maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33581) Refactor HivePartitionFilteringSuite

2020-11-28 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-33581.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30525

> Refactor HivePartitionFilteringSuite
> 
>
> Key: SPARK-33581
> URL: https://issues.apache.org/jira/browse/SPARK-33581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Refactor HivePartitionFilteringSuite, to make it easy to maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-33581) Refactor HivePartitionFilteringSuite

2020-11-28 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33581:

Comment: was deleted

(was: User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30525)

> Refactor HivePartitionFilteringSuite
> 
>
> Key: SPARK-33581
> URL: https://issues.apache.org/jira/browse/SPARK-33581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Refactor HivePartitionFilteringSuite, to make it easy to maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33581) Refactor HivePartitionFilteringSuite

2020-11-28 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-33581:
---

Assignee: Yuming Wang

> Refactor HivePartitionFilteringSuite
> 
>
> Key: SPARK-33581
> URL: https://issues.apache.org/jira/browse/SPARK-33581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Refactor HivePartitionFilteringSuite, to make it easy to maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33380) Incorrect output from example script pi.py

2020-11-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33380:


Assignee: (was: Apache Spark)

> Incorrect output from example script pi.py
> --
>
> Key: SPARK-33380
> URL: https://issues.apache.org/jira/browse/SPARK-33380
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.4.6
>Reporter: Milind V Damle
>Priority: Minor
>
>  
> I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 
> worker nodes. To test the installation, I ran the 
> $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. 
> Three runs produced the following output:
>  
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.149880
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.137760
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.155640
>  
> I noted that the computed value of Pi varies with each run.
> Next, I ran the same script 3 more times with a higher number of partitions 
> (16). The following output was noted.
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.141100
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.137720
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.145660
>  
> Again, I noted that the computed value of Pi varies with each run. 
>  
> IMO, there are 2 issues with this example script:
> 1. The output (value of pi) is non-deterministic because the script uses 
> random.random(). 
> 2. Specifying the number of partitions (accepted as a command-line argument) 
> has no observable positive impact on the accuracy or precision. 
>  
> It may be argued that the intent of these examples scripts is simply to 
> demonstrate how to use Spark as well as offer a means to quickly verify an 
> installation. However, we can achieve that objective without compromising on 
> the accuracy or determinism of the computed value. Unless the user examines 
> the script and understands that use of random.random() (to generate random 
> points within the top right quadrant of the circle) as the reason behind the 
> non-determinism, it seems confusing at first that the value varies per run 
> and also that it is inaccurate. Someone may (incorrectly) infer that as a 
> limitation of the framework!
>  
> To mitigate this, I wrote an alternate version to compute pi using a partial 
> sum of terms from an infinite series. This script is both deterministic and 
> can produce more accurate output if the user configures it to use more terms. 
> To me, that behavior feels intuitive and logical. I will be happy to share it 
> if it is appropriate.
>  
> Best regards,
> Milind
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33380) Incorrect output from example script pi.py

2020-11-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240169#comment-17240169
 ] 

Apache Spark commented on SPARK-33380:
--

User 'milindvdamle' has created a pull request for this issue:
https://github.com/apache/spark/pull/30533

> Incorrect output from example script pi.py
> --
>
> Key: SPARK-33380
> URL: https://issues.apache.org/jira/browse/SPARK-33380
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.4.6
>Reporter: Milind V Damle
>Priority: Minor
>
>  
> I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 
> worker nodes. To test the installation, I ran the 
> $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. 
> Three runs produced the following output:
>  
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.149880
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.137760
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.155640
>  
> I noted that the computed value of Pi varies with each run.
> Next, I ran the same script 3 more times with a higher number of partitions 
> (16). The following output was noted.
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.141100
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.137720
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.145660
>  
> Again, I noted that the computed value of Pi varies with each run. 
>  
> IMO, there are 2 issues with this example script:
> 1. The output (value of pi) is non-deterministic because the script uses 
> random.random(). 
> 2. Specifying the number of partitions (accepted as a command-line argument) 
> has no observable positive impact on the accuracy or precision. 
>  
> It may be argued that the intent of these examples scripts is simply to 
> demonstrate how to use Spark as well as offer a means to quickly verify an 
> installation. However, we can achieve that objective without compromising on 
> the accuracy or determinism of the computed value. Unless the user examines 
> the script and understands that use of random.random() (to generate random 
> points within the top right quadrant of the circle) as the reason behind the 
> non-determinism, it seems confusing at first that the value varies per run 
> and also that it is inaccurate. Someone may (incorrectly) infer that as a 
> limitation of the framework!
>  
> To mitigate this, I wrote an alternate version to compute pi using a partial 
> sum of terms from an infinite series. This script is both deterministic and 
> can produce more accurate output if the user configures it to use more terms. 
> To me, that behavior feels intuitive and logical. I will be happy to share it 
> if it is appropriate.
>  
> Best regards,
> Milind
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33380) Incorrect output from example script pi.py

2020-11-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33380:


Assignee: Apache Spark

> Incorrect output from example script pi.py
> --
>
> Key: SPARK-33380
> URL: https://issues.apache.org/jira/browse/SPARK-33380
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.4.6
>Reporter: Milind V Damle
>Assignee: Apache Spark
>Priority: Minor
>
>  
> I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 
> worker nodes. To test the installation, I ran the 
> $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. 
> Three runs produced the following output:
>  
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.149880
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.137760
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.155640
>  
> I noted that the computed value of Pi varies with each run.
> Next, I ran the same script 3 more times with a higher number of partitions 
> (16). The following output was noted.
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.141100
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.137720
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.145660
>  
> Again, I noted that the computed value of Pi varies with each run. 
>  
> IMO, there are 2 issues with this example script:
> 1. The output (value of pi) is non-deterministic because the script uses 
> random.random(). 
> 2. Specifying the number of partitions (accepted as a command-line argument) 
> has no observable positive impact on the accuracy or precision. 
>  
> It may be argued that the intent of these examples scripts is simply to 
> demonstrate how to use Spark as well as offer a means to quickly verify an 
> installation. However, we can achieve that objective without compromising on 
> the accuracy or determinism of the computed value. Unless the user examines 
> the script and understands that use of random.random() (to generate random 
> points within the top right quadrant of the circle) as the reason behind the 
> non-determinism, it seems confusing at first that the value varies per run 
> and also that it is inaccurate. Someone may (incorrectly) infer that as a 
> limitation of the framework!
>  
> To mitigate this, I wrote an alternate version to compute pi using a partial 
> sum of terms from an infinite series. This script is both deterministic and 
> can produce more accurate output if the user configures it to use more terms. 
> To me, that behavior feels intuitive and logical. I will be happy to share it 
> if it is appropriate.
>  
> Best regards,
> Milind
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33564) Prometheus metrics for Master and Worker isn't working

2020-11-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33564.
---
Resolution: Invalid

> Prometheus metrics for Master and Worker isn't working 
> ---
>
> Key: SPARK-33564
> URL: https://issues.apache.org/jira/browse/SPARK-33564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Paulo Roberto de Oliveira Castro
>Priority: Major
>  Labels: Metrics, metrics, prometheus
>
> Following the [PR|https://github.com/apache/spark/pull/25769] that introduced 
> the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}}  
> (also tested with 3.0.0), uncompressed the tgz and created a file called 
> {{metrics.properties}} adding this content:
> {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}}
>  {{*.sink.prometheusServlet.path=/metrics/prometheus}}
>  master.sink.prometheusServlet.path=/metrics/master/prometheus
>  applications.sink.prometheusServlet.path=/metrics/applications/prometheus
> {quote}
> Then I ran: 
> {quote}{{$ sbin/start-master.sh}}
>  {{$ sbin/start-slave.sh spark://`hostname`:7077}}
>  {{$ bin/spark-shell --master spark://`hostname`:7077 
> --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}}
> {quote}
> {{The Spark shell opens without problems:}}
> {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable}}
> {{Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties}}
> {{Setting default log level to "WARN".}}
> {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).}}
> {{Spark context Web UI available at 
> [http://192.168.0.6:4040|http://192.168.0.6:4040/]}}
> {{Spark context available as 'sc' (master = 
> spark://MacBook-Pro-de-Paulo-2.local:7077, app id = 
> app-20201125173618-0002).}}
> {{Spark session available as 'spark'.}}
> {{Welcome to}}
> {{                    __}}
> {{     / __/_   _/ /__}}
> {{    _\ \/ _ \/ _ `/ __/  '_/}}
> {{   /___/ .__/_,_/_/ /_/_\   version 3.0.0}}
> {{      /_/}}
> {{         }}
> {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}}
> {{Type in expressions to have them evaluated.}}
> {{Type :help for more information. }}
> {{scala>}}
> {quote}
> {{And when I try to fetch prometheus metrics for driver, everything works 
> fine:}}
> {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"}
>  0
> {quote}
> *The problem appears when I try accessing master metrics*, and I get the 
> following problem:
> {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}}
> {{}}
> {{      }}
> {{         type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')}}
> {{        }}
> {{         href="/static/spark-logo-77x50px-hd.png">}}
> {{        Spark Master at 
> spark://MacBook-Pro-de-Paulo-2.local:7077}}
> {{      }}
> {{      }}
> {{        }}
> {{          }}
> {{            }}
> {{              }}
> {{                }}
> {{                  }}
> {{                  3.0.0}}
> {{                }}
> {{                Spark Master at spark://MacBook-Pro-de-Paulo-2.local:7077}}
> {{              }}
> {{            }}
> {{          }}
> {{          }}
> {{          }}
> {{            }}
> {{              URL: 
> spark://MacBook-Pro-de-Paulo-2.local:7077}}
>  ...
> {quote}
> Instead of the metrics I'm getting an HTML page.  The same happens for all of 
> those here:
> {quote}{{$ curl -s [http://localhost:8080/metrics/applications/prometheus/]}}
>  {{$ curl -s [http://localhost:8081/metrics/prometheus/]}}
> {quote}
> Instead, *I expected metrics in prometheus metrics*. All related 

[jira] [Commented] (SPARK-33564) Prometheus metrics for Master and Worker isn't working

2020-11-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240098#comment-17240098
 ] 

Dongjoon Hyun commented on SPARK-33564:
---

The following is the correct way on Apache Spark 3.1. I'd like to recommend you 
two things: (1) Use `/` at the end of URL, (2) Have `conf/metrics.properties` 
before running `sbin/start-master.sh`.

*Apache Spark 3.0.1*
{code}
$ cat conf/metrics.properties
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus

$ sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to 
/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/logs/spark-dongjoon-org.apache.spark.deploy.master.Master-1-AppleMBP19.local.out

$ curl -s http://localhost:8080/metrics/master/prometheus/ | head -n3
metrics_master_aliveWorkers_Number{type="gauges"} 0
metrics_master_aliveWorkers_Value{type="gauges"} 0
metrics_master_apps_Number{type="gauges"} 0
{code}

> Prometheus metrics for Master and Worker isn't working 
> ---
>
> Key: SPARK-33564
> URL: https://issues.apache.org/jira/browse/SPARK-33564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Paulo Roberto de Oliveira Castro
>Priority: Major
>  Labels: Metrics, metrics, prometheus
>
> Following the [PR|https://github.com/apache/spark/pull/25769] that introduced 
> the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}}  
> (also tested with 3.0.0), uncompressed the tgz and created a file called 
> {{metrics.properties}} adding this content:
> {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}}
>  {{*.sink.prometheusServlet.path=/metrics/prometheus}}
>  master.sink.prometheusServlet.path=/metrics/master/prometheus
>  applications.sink.prometheusServlet.path=/metrics/applications/prometheus
> {quote}
> Then I ran: 
> {quote}{{$ sbin/start-master.sh}}
>  {{$ sbin/start-slave.sh spark://`hostname`:7077}}
>  {{$ bin/spark-shell --master spark://`hostname`:7077 
> --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}}
> {quote}
> {{The Spark shell opens without problems:}}
> {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable}}
> {{Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties}}
> {{Setting default log level to "WARN".}}
> {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).}}
> {{Spark context Web UI available at 
> [http://192.168.0.6:4040|http://192.168.0.6:4040/]}}
> {{Spark context available as 'sc' (master = 
> spark://MacBook-Pro-de-Paulo-2.local:7077, app id = 
> app-20201125173618-0002).}}
> {{Spark session available as 'spark'.}}
> {{Welcome to}}
> {{                    __}}
> {{     / __/_   _/ /__}}
> {{    _\ \/ _ \/ _ `/ __/  '_/}}
> {{   /___/ .__/_,_/_/ /_/_\   version 3.0.0}}
> {{      /_/}}
> {{         }}
> {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}}
> {{Type in expressions to have them evaluated.}}
> {{Type :help for more information. }}
> {{scala>}}
> {quote}
> {{And when I try to fetch prometheus metrics for driver, everything works 
> fine:}}
> {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"}
>  0
> {quote}
> *The problem appears when I try accessing master metrics*, and I get the 
> following problem:
> {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}}
> {{}}
> {{      }}
> {{         type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> 

[jira] [Commented] (SPARK-33564) Prometheus metrics for Master and Worker isn't working

2020-11-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240096#comment-17240096
 ] 

Dongjoon Hyun commented on SPARK-33564:
---

[~paulo.castro]. It looks like you missed the ending `/`, doesn't it?

> Prometheus metrics for Master and Worker isn't working 
> ---
>
> Key: SPARK-33564
> URL: https://issues.apache.org/jira/browse/SPARK-33564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Paulo Roberto de Oliveira Castro
>Priority: Major
>  Labels: Metrics, metrics, prometheus
>
> Following the [PR|https://github.com/apache/spark/pull/25769] that introduced 
> the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}}  
> (also tested with 3.0.0), uncompressed the tgz and created a file called 
> {{metrics.properties}} adding this content:
> {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}}
>  {{*.sink.prometheusServlet.path=/metrics/prometheus}}
>  master.sink.prometheusServlet.path=/metrics/master/prometheus
>  applications.sink.prometheusServlet.path=/metrics/applications/prometheus
> {quote}
> Then I ran: 
> {quote}{{$ sbin/start-master.sh}}
>  {{$ sbin/start-slave.sh spark://`hostname`:7077}}
>  {{$ bin/spark-shell --master spark://`hostname`:7077 
> --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}}
> {quote}
> {{The Spark shell opens without problems:}}
> {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable}}
> {{Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties}}
> {{Setting default log level to "WARN".}}
> {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).}}
> {{Spark context Web UI available at 
> [http://192.168.0.6:4040|http://192.168.0.6:4040/]}}
> {{Spark context available as 'sc' (master = 
> spark://MacBook-Pro-de-Paulo-2.local:7077, app id = 
> app-20201125173618-0002).}}
> {{Spark session available as 'spark'.}}
> {{Welcome to}}
> {{                    __}}
> {{     / __/_   _/ /__}}
> {{    _\ \/ _ \/ _ `/ __/  '_/}}
> {{   /___/ .__/_,_/_/ /_/_\   version 3.0.0}}
> {{      /_/}}
> {{         }}
> {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}}
> {{Type in expressions to have them evaluated.}}
> {{Type :help for more information. }}
> {{scala>}}
> {quote}
> {{And when I try to fetch prometheus metrics for driver, everything works 
> fine:}}
> {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"}
>  0
> {quote}
> *The problem appears when I try accessing master metrics*, and I get the 
> following problem:
> {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}}
> {{}}
> {{      }}
> {{         type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')}}
> {{        }}
> {{         href="/static/spark-logo-77x50px-hd.png">}}
> {{        Spark Master at 
> spark://MacBook-Pro-de-Paulo-2.local:7077}}
> {{      }}
> {{      }}
> {{        }}
> {{          }}
> {{            }}
> {{              }}
> {{                }}
> {{                  }}
> {{                  3.0.0}}
> {{                }}
> {{                Spark Master at spark://MacBook-Pro-de-Paulo-2.local:7077}}
> {{              }}
> {{            }}
> {{          }}
> {{          }}
> {{          }}
> {{            }}
> {{              URL: 
> spark://MacBook-Pro-de-Paulo-2.local:7077}}
>  ...
> {quote}
> Instead of the metrics I'm getting an HTML page.  The same happens for all of 
> those here:
> {quote}{{$ curl -s [http://localhost:8080/metrics/applications/prometheus/]}}
>  {{$ curl -s 

[jira] [Updated] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column

2020-11-28 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33585:
---
Affects Version/s: (was: 2.4.7)
   (was: 3.0.1)
   3.0.2
   2.4.8

> The comment for SQLContext.tables() doesn't mention the `database` column
> -
>
> Key: SPARK-33585
> URL: https://issues.apache.org/jira/browse/SPARK-33585
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The comment says: "The returned DataFrame has two columns, tableName and 
> isTemporary":
> https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664
> but actually the dataframe has 3 columns:
> {code:scala}
> scala> spark.range(10).createOrReplaceTempView("view1")
> scala> val tables = spark.sqlContext.tables()
> tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string 
> ... 1 more field]
> scala> tables.printSchema
> root
>  |-- database: string (nullable = false)
>  |-- tableName: string (nullable = false)
>  |-- isTemporary: boolean (nullable = false)
> scala> tables.show
> ++-+---+
> |database|tableName|isTemporary|
> ++-+---+
> | default|   t1|  false|
> | default|   t2|  false|
> | default|  ymd|  false|
> ||view1|   true|
> ++-+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33580) resolveDependencyPaths should use classifier attribute of artifact

2020-11-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33580.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30524
[https://github.com/apache/spark/pull/30524]

> resolveDependencyPaths should use classifier attribute of artifact
> --
>
> Key: SPARK-33580
> URL: https://issues.apache.org/jira/browse/SPARK-33580
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> `resolveDependencyPaths` now takes artifact type to decide to add "-tests" 
> postfix. However, the path pattern of ivy in `resolveMavenCoordinates` is 
> "[organization]_[artifact]-[revision](-[classifier]).[ext]". We should use 
> classifier instead of type to construct file path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`

2020-11-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33588:


Assignee: Apache Spark

> Partition spec in SHOW TABLE EXTENDED doesn't respect 
> `spark.sql.caseSensitive`
> ---
>
> Key: SPARK-33588
> URL: https://issues.apache.org/jira/browse/SPARK-33588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> For example:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
>  > USING parquet
>  > partitioned by (year, month);
> spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
> spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1);
> Error in query: Partition spec is invalid. The spec (YEAR, Month) must match 
> the partition spec (year, month) defined in table '`default`.`tbl1`';
> {code}
> The spark.sql.caseSensitive flag is false by default, so, the partition spec 
> is valid.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`

2020-11-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33588:


Assignee: (was: Apache Spark)

> Partition spec in SHOW TABLE EXTENDED doesn't respect 
> `spark.sql.caseSensitive`
> ---
>
> Key: SPARK-33588
> URL: https://issues.apache.org/jira/browse/SPARK-33588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> For example:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
>  > USING parquet
>  > partitioned by (year, month);
> spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
> spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1);
> Error in query: Partition spec is invalid. The spec (YEAR, Month) must match 
> the partition spec (year, month) defined in table '`default`.`tbl1`';
> {code}
> The spark.sql.caseSensitive flag is false by default, so, the partition spec 
> is valid.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`

2020-11-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240045#comment-17240045
 ] 

Apache Spark commented on SPARK-33588:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30529

> Partition spec in SHOW TABLE EXTENDED doesn't respect 
> `spark.sql.caseSensitive`
> ---
>
> Key: SPARK-33588
> URL: https://issues.apache.org/jira/browse/SPARK-33588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> For example:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
>  > USING parquet
>  > partitioned by (year, month);
> spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
> spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1);
> Error in query: Partition spec is invalid. The spec (YEAR, Month) must match 
> the partition spec (year, month) defined in table '`default`.`tbl1`';
> {code}
> The spark.sql.caseSensitive flag is false by default, so, the partition spec 
> is valid.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`

2020-11-28 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33588:
--

 Summary: Partition spec in SHOW TABLE EXTENDED doesn't respect 
`spark.sql.caseSensitive`
 Key: SPARK-33588
 URL: https://issues.apache.org/jira/browse/SPARK-33588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.8, 3.0.2, 3.1.0
Reporter: Maxim Gekk


For example:
{code:sql}
spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
 > USING parquet
 > partitioned by (year, month);
spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1);
Error in query: Partition spec is invalid. The spec (YEAR, Month) must match 
the partition spec (year, month) defined in table '`default`.`tbl1`';
{code}
The spark.sql.caseSensitive flag is false by default, so, the partition spec is 
valid.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33587) Kill the executor on nested fatal errors

2020-11-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33587:


Assignee: Apache Spark

> Kill the executor on nested fatal errors
> 
>
> Key: SPARK-33587
> URL: https://issues.apache.org/jira/browse/SPARK-33587
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> Currently we kill the executor when hitting a fatal error. However, if the 
> fatal error is wrapped by another exception, such as
> - java.util.concurrent.ExecutionException, 
> com.google.common.util.concurrent.UncheckedExecutionException, 
> com.google.common.util.concurrent.ExecutionError when using Guava cache and 
> java thread pool.
> - SparkException thrown from this line: 
> https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231
> We will still keep the executor running. Fatal errors are usually 
> unrecoverable (such as OutOfMemoryError), some components may be in a broken 
> state when hitting a fatal error. Hence, it's better to detect the nested 
> fatal error as well and kill the executor. Then we can rely on Spark's fault 
> tolerance to recover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33587) Kill the executor on nested fatal errors

2020-11-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33587:


Assignee: (was: Apache Spark)

> Kill the executor on nested fatal errors
> 
>
> Key: SPARK-33587
> URL: https://issues.apache.org/jira/browse/SPARK-33587
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> Currently we kill the executor when hitting a fatal error. However, if the 
> fatal error is wrapped by another exception, such as
> - java.util.concurrent.ExecutionException, 
> com.google.common.util.concurrent.UncheckedExecutionException, 
> com.google.common.util.concurrent.ExecutionError when using Guava cache and 
> java thread pool.
> - SparkException thrown from this line: 
> https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231
> We will still keep the executor running. Fatal errors are usually 
> unrecoverable (such as OutOfMemoryError), some components may be in a broken 
> state when hitting a fatal error. Hence, it's better to detect the nested 
> fatal error as well and kill the executor. Then we can rely on Spark's fault 
> tolerance to recover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33587) Kill the executor on nested fatal errors

2020-11-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240029#comment-17240029
 ] 

Apache Spark commented on SPARK-33587:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/30528

> Kill the executor on nested fatal errors
> 
>
> Key: SPARK-33587
> URL: https://issues.apache.org/jira/browse/SPARK-33587
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> Currently we kill the executor when hitting a fatal error. However, if the 
> fatal error is wrapped by another exception, such as
> - java.util.concurrent.ExecutionException, 
> com.google.common.util.concurrent.UncheckedExecutionException, 
> com.google.common.util.concurrent.ExecutionError when using Guava cache and 
> java thread pool.
> - SparkException thrown from this line: 
> https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231
> We will still keep the executor running. Fatal errors are usually 
> unrecoverable (such as OutOfMemoryError), some components may be in a broken 
> state when hitting a fatal error. Hence, it's better to detect the nested 
> fatal error as well and kill the executor. Then we can rely on Spark's fault 
> tolerance to recover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33587) Kill the executor on nested fatal errors

2020-11-28 Thread Shixiong Zhu (Jira)
Shixiong Zhu created SPARK-33587:


 Summary: Kill the executor on nested fatal errors
 Key: SPARK-33587
 URL: https://issues.apache.org/jira/browse/SPARK-33587
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Shixiong Zhu


Currently we kill the executor when hitting a fatal error. However, if the 
fatal error is wrapped by another exception, such as
- java.util.concurrent.ExecutionException, 
com.google.common.util.concurrent.UncheckedExecutionException, 
com.google.common.util.concurrent.ExecutionError when using Guava cache and 
java thread pool.
- SparkException thrown from this line: 
https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231

We will still keep the executor running. Fatal errors are usually unrecoverable 
(such as OutOfMemoryError), some components may be in a broken state when 
hitting a fatal error. Hence, it's better to detect the nested fatal error as 
well and kill the executor. Then we can rely on Spark's fault tolerance to 
recover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33586) BisectingKMeansModel save and load implementation in pyspark

2020-11-28 Thread Iman Kermani (Jira)
Iman Kermani created SPARK-33586:


 Summary: BisectingKMeansModel save and load implementation in 
pyspark
 Key: SPARK-33586
 URL: https://issues.apache.org/jira/browse/SPARK-33586
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 3.0.1
 Environment: Spark 3.0.1 with Hadoop 2.7
Reporter: Iman Kermani


BisectingKMeansModel save and load functions are implemented in Java and Scala.

It would be nice if it was implemented in pyspark too.

Thanks in advance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationSuite

2020-11-28 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239994#comment-17239994
 ] 

Kousuke Saruta commented on SPARK-33570:


Issue resolved by pull request 30515

https://github.com/apache/spark/pull/30515

> Set the proper version of gssapi plugin automatically for 
> MariaDBKrbIntegrationSuite
> 
>
> Key: SPARK-33570
> URL: https://issues.apache.org/jira/browse/SPARK-33570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server 
> is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer 
> available in the official apt repository and MariaDBKrbIntegrationSuite 
> doesn't pass for now.
> It seems that only the most recent three versions are available and they are 
> 10.5.6, 10.5.7 and 10.5.8 for now.
> Further, the release cycle of MariaDB seems to be very rapid (1 ~ 2 months) 
> so I don't think it's a good idea to set to an specific version for 
> mariadb-plugin-gssapi-server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationSuite

2020-11-28 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-33570.

Fix Version/s: 3.1.0
   Resolution: Fixed

> Set the proper version of gssapi plugin automatically for 
> MariaDBKrbIntegrationSuite
> 
>
> Key: SPARK-33570
> URL: https://issues.apache.org/jira/browse/SPARK-33570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server 
> is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer 
> available in the official apt repository and MariaDBKrbIntegrationSuite 
> doesn't pass for now.
> It seems that only the most recent three versions are available and they are 
> 10.5.6, 10.5.7 and 10.5.8 for now.
> Further, the release cycle of MariaDB seems to be very rapid (1 ~ 2 months) 
> so I don't think it's a good idea to set to an specific version for 
> mariadb-plugin-gssapi-server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationSuite

2020-11-28 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-33570:
---
Summary: Set the proper version of gssapi plugin automatically for 
MariaDBKrbIntegrationSuite  (was: Set the proper version of gssapi plugin 
automatically for MariaDBKrbIntegrationsuite)

> Set the proper version of gssapi plugin automatically for 
> MariaDBKrbIntegrationSuite
> 
>
> Key: SPARK-33570
> URL: https://issues.apache.org/jira/browse/SPARK-33570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server 
> is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer 
> available in the official apt repository and MariaDBKrbIntegrationSuite 
> doesn't pass for now.
> It seems that only the most recent three versions are available and they are 
> 10.5.6, 10.5.7 and 10.5.8 for now.
> Further, the release cycle of MariaDB seems to be very rapid (1 ~ 2 months) 
> so I don't think it's a good idea to set to an specific version for 
> mariadb-plugin-gssapi-server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column

2020-11-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33585:


Assignee: (was: Apache Spark)

> The comment for SQLContext.tables() doesn't mention the `database` column
> -
>
> Key: SPARK-33585
> URL: https://issues.apache.org/jira/browse/SPARK-33585
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The comment says: "The returned DataFrame has two columns, tableName and 
> isTemporary":
> https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664
> but actually the dataframe has 3 columns:
> {code:scala}
> scala> spark.range(10).createOrReplaceTempView("view1")
> scala> val tables = spark.sqlContext.tables()
> tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string 
> ... 1 more field]
> scala> tables.printSchema
> root
>  |-- database: string (nullable = false)
>  |-- tableName: string (nullable = false)
>  |-- isTemporary: boolean (nullable = false)
> scala> tables.show
> ++-+---+
> |database|tableName|isTemporary|
> ++-+---+
> | default|   t1|  false|
> | default|   t2|  false|
> | default|  ymd|  false|
> ||view1|   true|
> ++-+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column

2020-11-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239991#comment-17239991
 ] 

Apache Spark commented on SPARK-33585:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30526

> The comment for SQLContext.tables() doesn't mention the `database` column
> -
>
> Key: SPARK-33585
> URL: https://issues.apache.org/jira/browse/SPARK-33585
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The comment says: "The returned DataFrame has two columns, tableName and 
> isTemporary":
> https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664
> but actually the dataframe has 3 columns:
> {code:scala}
> scala> spark.range(10).createOrReplaceTempView("view1")
> scala> val tables = spark.sqlContext.tables()
> tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string 
> ... 1 more field]
> scala> tables.printSchema
> root
>  |-- database: string (nullable = false)
>  |-- tableName: string (nullable = false)
>  |-- isTemporary: boolean (nullable = false)
> scala> tables.show
> ++-+---+
> |database|tableName|isTemporary|
> ++-+---+
> | default|   t1|  false|
> | default|   t2|  false|
> | default|  ymd|  false|
> ||view1|   true|
> ++-+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column

2020-11-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33585:


Assignee: Apache Spark

> The comment for SQLContext.tables() doesn't mention the `database` column
> -
>
> Key: SPARK-33585
> URL: https://issues.apache.org/jira/browse/SPARK-33585
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> The comment says: "The returned DataFrame has two columns, tableName and 
> isTemporary":
> https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664
> but actually the dataframe has 3 columns:
> {code:scala}
> scala> spark.range(10).createOrReplaceTempView("view1")
> scala> val tables = spark.sqlContext.tables()
> tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string 
> ... 1 more field]
> scala> tables.printSchema
> root
>  |-- database: string (nullable = false)
>  |-- tableName: string (nullable = false)
>  |-- isTemporary: boolean (nullable = false)
> scala> tables.show
> ++-+---+
> |database|tableName|isTemporary|
> ++-+---+
> | default|   t1|  false|
> | default|   t2|  false|
> | default|  ymd|  false|
> ||view1|   true|
> ++-+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column

2020-11-28 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33585:
--

 Summary: The comment for SQLContext.tables() doesn't mention the 
`database` column
 Key: SPARK-33585
 URL: https://issues.apache.org/jira/browse/SPARK-33585
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.0.1, 2.4.7, 3.1.0
Reporter: Maxim Gekk


The comment says: "The returned DataFrame has two columns, tableName and 
isTemporary":
https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664

but actually the dataframe has 3 columns:
{code:scala}
scala> spark.range(10).createOrReplaceTempView("view1")
scala> val tables = spark.sqlContext.tables()
tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string 
... 1 more field]

scala> tables.printSchema
root
 |-- database: string (nullable = false)
 |-- tableName: string (nullable = false)
 |-- isTemporary: boolean (nullable = false)


scala> tables.show
++-+---+
|database|tableName|isTemporary|
++-+---+
| default|   t1|  false|
| default|   t2|  false|
| default|  ymd|  false|
||view1|   true|
++-+---+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33584) Partition predicate pushdown into Hive metastore support cast string type to date type

2020-11-28 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239935#comment-17239935
 ] 

Yuming Wang commented on SPARK-33584:
-

This change should after SPARK-33581, I have prepared the pr: 
https://github.com/wangyum/spark/tree/SPARK-33584

> Partition predicate pushdown into Hive metastore support cast string type to 
> date type
> --
>
> Key: SPARK-33584
> URL: https://issues.apache.org/jira/browse/SPARK-33584
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("create table t1(id string) partitioned by (part string) stored as 
> parquet")
> spark.sql("insert into t1 values('1', '2019-01-01')")
> spark.sql("insert into t1 values('2', '2019-01-02')")
> spark.sql("select * from t1 where  part = date '2019-01-01' ").show
> {code}
> We can pushdown {{cast(part as date) = date '2019-01-01' }} to Hive metastore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33584) Partition predicate pushdown into Hive metastore support cast string type to date type

2020-11-28 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33584:
---

 Summary: Partition predicate pushdown into Hive metastore support 
cast string type to date type
 Key: SPARK-33584
 URL: https://issues.apache.org/jira/browse/SPARK-33584
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang
Assignee: Yuming Wang


{code:scala}
spark.sql("create table t1(id string) partitioned by (part string) stored as 
parquet")
spark.sql("insert into t1 values('1', '2019-01-01')")
spark.sql("insert into t1 values('2', '2019-01-02')")
spark.sql("select * from t1 where  part = date '2019-01-01' ").show
{code}

We can pushdown {{cast(part as date) = date '2019-01-01' }} to Hive metastore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33582) Partition predicate pushdown into Hive metastore support not-equals

2020-11-28 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33582:

Summary: Partition predicate pushdown into Hive metastore support 
not-equals  (was: Hive partition pruning support not-equals)

> Partition predicate pushdown into Hive metastore support not-equals
> ---
>
> Key: SPARK-33582
> URL: https://issues.apache.org/jira/browse/SPARK-33582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207
> https://issues.apache.org/jira/browse/HIVE-2702



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33583) Query on large dataset with forEachPartitionAsync performance needs to improve

2020-11-28 Thread Miron (Jira)
Miron created SPARK-33583:
-

 Summary: Query on large dataset with forEachPartitionAsync 
performance needs to improve
 Key: SPARK-33583
 URL: https://issues.apache.org/jira/browse/SPARK-33583
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
 Environment: Spark 2.4.4

Scala 2.11.10
Reporter: Miron


Repro steps:

Load 300GB of data from JSON file into a table.

Note in this table field ID with reasonably well sized sets, identified by ID, 
some 50,000 rows a set.

Issue query against this table returning DataFrame instance.

Issue df.rdd.foreachPartitionAsync styled row harvesting.

Place a logging line into first lambda expression, iterating over partitions as 
a first line.

Let's say it will read "Line #1 ( some timestamp with milliseconds )"

Place a logging line into nested lambda expression, reading rows, such, that it 
would run only when accessing first row.

Let's say it will read "Line #2 ( some timestamp with milliseconds )"

Once query completed take time difference in milliseconds between time noted in 
logging records from line #1 and line #2 above.

It would be fairly reasonable to assume that the time difference should be as 
close to 0 as possible. In reality the difference is more then 1 second, 
usually more than 2.

This really hurts query performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33583) Query on large dataset with foreachPartitionAsync performance needs to improve

2020-11-28 Thread Miron (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miron updated SPARK-33583:
--
Summary: Query on large dataset with foreachPartitionAsync performance 
needs to improve  (was: Query on large dataset with forEachPartitionAsync 
performance needs to improve)

> Query on large dataset with foreachPartitionAsync performance needs to improve
> --
>
> Key: SPARK-33583
> URL: https://issues.apache.org/jira/browse/SPARK-33583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4
> Scala 2.11.10
>Reporter: Miron
>Priority: Major
>
> Repro steps:
> Load 300GB of data from JSON file into a table.
> Note in this table field ID with reasonably well sized sets, identified by 
> ID, some 50,000 rows a set.
> Issue query against this table returning DataFrame instance.
> Issue df.rdd.foreachPartitionAsync styled row harvesting.
> Place a logging line into first lambda expression, iterating over partitions 
> as a first line.
> Let's say it will read "Line #1 ( some timestamp with milliseconds )"
> Place a logging line into nested lambda expression, reading rows, such, that 
> it would run only when accessing first row.
> Let's say it will read "Line #2 ( some timestamp with milliseconds )"
> Once query completed take time difference in milliseconds between time noted 
> in logging records from line #1 and line #2 above.
> It would be fairly reasonable to assume that the time difference should be as 
> close to 0 as possible. In reality the difference is more then 1 second, 
> usually more than 2.
> This really hurts query performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org