[jira] [Issue Comment Deleted] (SPARK-33584) Partition predicate pushdown into Hive metastore support cast string type to date type
[ https://issues.apache.org/jira/browse/SPARK-33584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33584: Comment: was deleted (was: This change should after SPARK-33581, I have prepared the pr: https://github.com/wangyum/spark/tree/SPARK-33584) > Partition predicate pushdown into Hive metastore support cast string type to > date type > -- > > Key: SPARK-33584 > URL: https://issues.apache.org/jira/browse/SPARK-33584 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > {code:scala} > spark.sql("create table t1(id string) partitioned by (part string) stored as > parquet") > spark.sql("insert into t1 values('1', '2019-01-01')") > spark.sql("insert into t1 values('2', '2019-01-02')") > spark.sql("select * from t1 where part = date '2019-01-01' ").show > {code} > We can pushdown {{cast(part as date) = date '2019-01-01' }} to Hive metastore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-33582) Partition predicate pushdown into Hive metastore support not-equals
[ https://issues.apache.org/jira/browse/SPARK-33582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33582: Comment: was deleted (was: This change should after SPARK-33581, I have prepared the pr: https://github.com/wangyum/spark/tree/SPARK-33582) > Partition predicate pushdown into Hive metastore support not-equals > --- > > Key: SPARK-33582 > URL: https://issues.apache.org/jira/browse/SPARK-33582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207 > https://issues.apache.org/jira/browse/HIVE-2702 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33584) Partition predicate pushdown into Hive metastore support cast string type to date type
[ https://issues.apache.org/jira/browse/SPARK-33584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240174#comment-17240174 ] Apache Spark commented on SPARK-33584: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30535 > Partition predicate pushdown into Hive metastore support cast string type to > date type > -- > > Key: SPARK-33584 > URL: https://issues.apache.org/jira/browse/SPARK-33584 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > {code:scala} > spark.sql("create table t1(id string) partitioned by (part string) stored as > parquet") > spark.sql("insert into t1 values('1', '2019-01-01')") > spark.sql("insert into t1 values('2', '2019-01-02')") > spark.sql("select * from t1 where part = date '2019-01-01' ").show > {code} > We can pushdown {{cast(part as date) = date '2019-01-01' }} to Hive metastore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33582) Partition predicate pushdown into Hive metastore support not-equals
[ https://issues.apache.org/jira/browse/SPARK-33582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240173#comment-17240173 ] Apache Spark commented on SPARK-33582: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30534 > Partition predicate pushdown into Hive metastore support not-equals > --- > > Key: SPARK-33582 > URL: https://issues.apache.org/jira/browse/SPARK-33582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207 > https://issues.apache.org/jira/browse/HIVE-2702 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33581) Refactor HivePartitionFilteringSuite
[ https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240171#comment-17240171 ] Yuming Wang commented on SPARK-33581: - Issue resolved by pull request 30525 https://github.com/apache/spark/pull/30525 > Refactor HivePartitionFilteringSuite > > > Key: SPARK-33581 > URL: https://issues.apache.org/jira/browse/SPARK-33581 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.1.0 > > > Refactor HivePartitionFilteringSuite, to make it easy to maintain. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33581) Refactor HivePartitionFilteringSuite
[ https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-33581. - Fix Version/s: 3.1.0 Resolution: Fixed User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30525 > Refactor HivePartitionFilteringSuite > > > Key: SPARK-33581 > URL: https://issues.apache.org/jira/browse/SPARK-33581 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.1.0 > > > Refactor HivePartitionFilteringSuite, to make it easy to maintain. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-33581) Refactor HivePartitionFilteringSuite
[ https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33581: Comment: was deleted (was: User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30525) > Refactor HivePartitionFilteringSuite > > > Key: SPARK-33581 > URL: https://issues.apache.org/jira/browse/SPARK-33581 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.1.0 > > > Refactor HivePartitionFilteringSuite, to make it easy to maintain. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33581) Refactor HivePartitionFilteringSuite
[ https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-33581: --- Assignee: Yuming Wang > Refactor HivePartitionFilteringSuite > > > Key: SPARK-33581 > URL: https://issues.apache.org/jira/browse/SPARK-33581 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Refactor HivePartitionFilteringSuite, to make it easy to maintain. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33380) Incorrect output from example script pi.py
[ https://issues.apache.org/jira/browse/SPARK-33380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33380: Assignee: (was: Apache Spark) > Incorrect output from example script pi.py > -- > > Key: SPARK-33380 > URL: https://issues.apache.org/jira/browse/SPARK-33380 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.4.6 >Reporter: Milind V Damle >Priority: Minor > > > I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 > worker nodes. To test the installation, I ran the > $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. > Three runs produced the following output: > > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.149880 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.137760 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.155640 > > I noted that the computed value of Pi varies with each run. > Next, I ran the same script 3 more times with a higher number of partitions > (16). The following output was noted. > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.141100 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.137720 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.145660 > > Again, I noted that the computed value of Pi varies with each run. > > IMO, there are 2 issues with this example script: > 1. The output (value of pi) is non-deterministic because the script uses > random.random(). > 2. Specifying the number of partitions (accepted as a command-line argument) > has no observable positive impact on the accuracy or precision. > > It may be argued that the intent of these examples scripts is simply to > demonstrate how to use Spark as well as offer a means to quickly verify an > installation. However, we can achieve that objective without compromising on > the accuracy or determinism of the computed value. Unless the user examines > the script and understands that use of random.random() (to generate random > points within the top right quadrant of the circle) as the reason behind the > non-determinism, it seems confusing at first that the value varies per run > and also that it is inaccurate. Someone may (incorrectly) infer that as a > limitation of the framework! > > To mitigate this, I wrote an alternate version to compute pi using a partial > sum of terms from an infinite series. This script is both deterministic and > can produce more accurate output if the user configures it to use more terms. > To me, that behavior feels intuitive and logical. I will be happy to share it > if it is appropriate. > > Best regards, > Milind > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33380) Incorrect output from example script pi.py
[ https://issues.apache.org/jira/browse/SPARK-33380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240169#comment-17240169 ] Apache Spark commented on SPARK-33380: -- User 'milindvdamle' has created a pull request for this issue: https://github.com/apache/spark/pull/30533 > Incorrect output from example script pi.py > -- > > Key: SPARK-33380 > URL: https://issues.apache.org/jira/browse/SPARK-33380 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.4.6 >Reporter: Milind V Damle >Priority: Minor > > > I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 > worker nodes. To test the installation, I ran the > $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. > Three runs produced the following output: > > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.149880 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.137760 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.155640 > > I noted that the computed value of Pi varies with each run. > Next, I ran the same script 3 more times with a higher number of partitions > (16). The following output was noted. > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.141100 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.137720 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.145660 > > Again, I noted that the computed value of Pi varies with each run. > > IMO, there are 2 issues with this example script: > 1. The output (value of pi) is non-deterministic because the script uses > random.random(). > 2. Specifying the number of partitions (accepted as a command-line argument) > has no observable positive impact on the accuracy or precision. > > It may be argued that the intent of these examples scripts is simply to > demonstrate how to use Spark as well as offer a means to quickly verify an > installation. However, we can achieve that objective without compromising on > the accuracy or determinism of the computed value. Unless the user examines > the script and understands that use of random.random() (to generate random > points within the top right quadrant of the circle) as the reason behind the > non-determinism, it seems confusing at first that the value varies per run > and also that it is inaccurate. Someone may (incorrectly) infer that as a > limitation of the framework! > > To mitigate this, I wrote an alternate version to compute pi using a partial > sum of terms from an infinite series. This script is both deterministic and > can produce more accurate output if the user configures it to use more terms. > To me, that behavior feels intuitive and logical. I will be happy to share it > if it is appropriate. > > Best regards, > Milind > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33380) Incorrect output from example script pi.py
[ https://issues.apache.org/jira/browse/SPARK-33380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33380: Assignee: Apache Spark > Incorrect output from example script pi.py > -- > > Key: SPARK-33380 > URL: https://issues.apache.org/jira/browse/SPARK-33380 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.4.6 >Reporter: Milind V Damle >Assignee: Apache Spark >Priority: Minor > > > I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 > worker nodes. To test the installation, I ran the > $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. > Three runs produced the following output: > > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.149880 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.137760 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.155640 > > I noted that the computed value of Pi varies with each run. > Next, I ran the same script 3 more times with a higher number of partitions > (16). The following output was noted. > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.141100 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.137720 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.145660 > > Again, I noted that the computed value of Pi varies with each run. > > IMO, there are 2 issues with this example script: > 1. The output (value of pi) is non-deterministic because the script uses > random.random(). > 2. Specifying the number of partitions (accepted as a command-line argument) > has no observable positive impact on the accuracy or precision. > > It may be argued that the intent of these examples scripts is simply to > demonstrate how to use Spark as well as offer a means to quickly verify an > installation. However, we can achieve that objective without compromising on > the accuracy or determinism of the computed value. Unless the user examines > the script and understands that use of random.random() (to generate random > points within the top right quadrant of the circle) as the reason behind the > non-determinism, it seems confusing at first that the value varies per run > and also that it is inaccurate. Someone may (incorrectly) infer that as a > limitation of the framework! > > To mitigate this, I wrote an alternate version to compute pi using a partial > sum of terms from an infinite series. This script is both deterministic and > can produce more accurate output if the user configures it to use more terms. > To me, that behavior feels intuitive and logical. I will be happy to share it > if it is appropriate. > > Best regards, > Milind > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33564) Prometheus metrics for Master and Worker isn't working
[ https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33564. --- Resolution: Invalid > Prometheus metrics for Master and Worker isn't working > --- > > Key: SPARK-33564 > URL: https://issues.apache.org/jira/browse/SPARK-33564 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell >Affects Versions: 3.0.0, 3.0.1 >Reporter: Paulo Roberto de Oliveira Castro >Priority: Major > Labels: Metrics, metrics, prometheus > > Following the [PR|https://github.com/apache/spark/pull/25769] that introduced > the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}} > (also tested with 3.0.0), uncompressed the tgz and created a file called > {{metrics.properties}} adding this content: > {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}} > {{*.sink.prometheusServlet.path=/metrics/prometheus}} > master.sink.prometheusServlet.path=/metrics/master/prometheus > applications.sink.prometheusServlet.path=/metrics/applications/prometheus > {quote} > Then I ran: > {quote}{{$ sbin/start-master.sh}} > {{$ sbin/start-slave.sh spark://`hostname`:7077}} > {{$ bin/spark-shell --master spark://`hostname`:7077 > --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}} > {quote} > {{The Spark shell opens without problems:}} > {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable}} > {{Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties}} > {{Setting default log level to "WARN".}} > {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).}} > {{Spark context Web UI available at > [http://192.168.0.6:4040|http://192.168.0.6:4040/]}} > {{Spark context available as 'sc' (master = > spark://MacBook-Pro-de-Paulo-2.local:7077, app id = > app-20201125173618-0002).}} > {{Spark session available as 'spark'.}} > {{Welcome to}} > {{ __}} > {{ / __/_ _/ /__}} > {{ _\ \/ _ \/ _ `/ __/ '_/}} > {{ /___/ .__/_,_/_/ /_/_\ version 3.0.0}} > {{ /_/}} > {{ }} > {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}} > {{Type in expressions to have them evaluated.}} > {{Type :help for more information. }} > {{scala>}} > {quote} > {{And when I try to fetch prometheus metrics for driver, everything works > fine:}} > {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5 > metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"} > 0 > metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"} > 0 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"} > 732 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"} > 732 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"} > 0 > {quote} > *The problem appears when I try accessing master metrics*, and I get the > following problem: > {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}} > {{}} > {{ }} > {{ type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')}} > {{ }} > {{ href="/static/spark-logo-77x50px-hd.png">}} > {{ Spark Master at > spark://MacBook-Pro-de-Paulo-2.local:7077}} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ 3.0.0}} > {{ }} > {{ Spark Master at spark://MacBook-Pro-de-Paulo-2.local:7077}} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ URL: > spark://MacBook-Pro-de-Paulo-2.local:7077}} > ... > {quote} > Instead of the metrics I'm getting an HTML page. The same happens for all of > those here: > {quote}{{$ curl -s [http://localhost:8080/metrics/applications/prometheus/]}} > {{$ curl -s [http://localhost:8081/metrics/prometheus/]}} > {quote} > Instead, *I expected metrics in prometheus metrics*. All related
[jira] [Commented] (SPARK-33564) Prometheus metrics for Master and Worker isn't working
[ https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240098#comment-17240098 ] Dongjoon Hyun commented on SPARK-33564: --- The following is the correct way on Apache Spark 3.1. I'd like to recommend you two things: (1) Use `/` at the end of URL, (2) Have `conf/metrics.properties` before running `sbin/start-master.sh`. *Apache Spark 3.0.1* {code} $ cat conf/metrics.properties *.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet *.sink.prometheusServlet.path=/metrics/prometheus master.sink.prometheusServlet.path=/metrics/master/prometheus applications.sink.prometheusServlet.path=/metrics/applications/prometheus $ sbin/start-master.sh starting org.apache.spark.deploy.master.Master, logging to /Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/logs/spark-dongjoon-org.apache.spark.deploy.master.Master-1-AppleMBP19.local.out $ curl -s http://localhost:8080/metrics/master/prometheus/ | head -n3 metrics_master_aliveWorkers_Number{type="gauges"} 0 metrics_master_aliveWorkers_Value{type="gauges"} 0 metrics_master_apps_Number{type="gauges"} 0 {code} > Prometheus metrics for Master and Worker isn't working > --- > > Key: SPARK-33564 > URL: https://issues.apache.org/jira/browse/SPARK-33564 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell >Affects Versions: 3.0.0, 3.0.1 >Reporter: Paulo Roberto de Oliveira Castro >Priority: Major > Labels: Metrics, metrics, prometheus > > Following the [PR|https://github.com/apache/spark/pull/25769] that introduced > the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}} > (also tested with 3.0.0), uncompressed the tgz and created a file called > {{metrics.properties}} adding this content: > {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}} > {{*.sink.prometheusServlet.path=/metrics/prometheus}} > master.sink.prometheusServlet.path=/metrics/master/prometheus > applications.sink.prometheusServlet.path=/metrics/applications/prometheus > {quote} > Then I ran: > {quote}{{$ sbin/start-master.sh}} > {{$ sbin/start-slave.sh spark://`hostname`:7077}} > {{$ bin/spark-shell --master spark://`hostname`:7077 > --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}} > {quote} > {{The Spark shell opens without problems:}} > {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable}} > {{Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties}} > {{Setting default log level to "WARN".}} > {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).}} > {{Spark context Web UI available at > [http://192.168.0.6:4040|http://192.168.0.6:4040/]}} > {{Spark context available as 'sc' (master = > spark://MacBook-Pro-de-Paulo-2.local:7077, app id = > app-20201125173618-0002).}} > {{Spark session available as 'spark'.}} > {{Welcome to}} > {{ __}} > {{ / __/_ _/ /__}} > {{ _\ \/ _ \/ _ `/ __/ '_/}} > {{ /___/ .__/_,_/_/ /_/_\ version 3.0.0}} > {{ /_/}} > {{ }} > {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}} > {{Type in expressions to have them evaluated.}} > {{Type :help for more information. }} > {{scala>}} > {quote} > {{And when I try to fetch prometheus metrics for driver, everything works > fine:}} > {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5 > metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"} > 0 > metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"} > 0 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"} > 732 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"} > 732 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"} > 0 > {quote} > *The problem appears when I try accessing master metrics*, and I get the > following problem: > {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}} > {{}} > {{ }} > {{ type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js">
[jira] [Commented] (SPARK-33564) Prometheus metrics for Master and Worker isn't working
[ https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240096#comment-17240096 ] Dongjoon Hyun commented on SPARK-33564: --- [~paulo.castro]. It looks like you missed the ending `/`, doesn't it? > Prometheus metrics for Master and Worker isn't working > --- > > Key: SPARK-33564 > URL: https://issues.apache.org/jira/browse/SPARK-33564 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell >Affects Versions: 3.0.0, 3.0.1 >Reporter: Paulo Roberto de Oliveira Castro >Priority: Major > Labels: Metrics, metrics, prometheus > > Following the [PR|https://github.com/apache/spark/pull/25769] that introduced > the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}} > (also tested with 3.0.0), uncompressed the tgz and created a file called > {{metrics.properties}} adding this content: > {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}} > {{*.sink.prometheusServlet.path=/metrics/prometheus}} > master.sink.prometheusServlet.path=/metrics/master/prometheus > applications.sink.prometheusServlet.path=/metrics/applications/prometheus > {quote} > Then I ran: > {quote}{{$ sbin/start-master.sh}} > {{$ sbin/start-slave.sh spark://`hostname`:7077}} > {{$ bin/spark-shell --master spark://`hostname`:7077 > --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}} > {quote} > {{The Spark shell opens without problems:}} > {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable}} > {{Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties}} > {{Setting default log level to "WARN".}} > {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).}} > {{Spark context Web UI available at > [http://192.168.0.6:4040|http://192.168.0.6:4040/]}} > {{Spark context available as 'sc' (master = > spark://MacBook-Pro-de-Paulo-2.local:7077, app id = > app-20201125173618-0002).}} > {{Spark session available as 'spark'.}} > {{Welcome to}} > {{ __}} > {{ / __/_ _/ /__}} > {{ _\ \/ _ \/ _ `/ __/ '_/}} > {{ /___/ .__/_,_/_/ /_/_\ version 3.0.0}} > {{ /_/}} > {{ }} > {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}} > {{Type in expressions to have them evaluated.}} > {{Type :help for more information. }} > {{scala>}} > {quote} > {{And when I try to fetch prometheus metrics for driver, everything works > fine:}} > {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5 > metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"} > 0 > metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"} > 0 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"} > 732 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"} > 732 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"} > 0 > {quote} > *The problem appears when I try accessing master metrics*, and I get the > following problem: > {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}} > {{}} > {{ }} > {{ type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')}} > {{ }} > {{ href="/static/spark-logo-77x50px-hd.png">}} > {{ Spark Master at > spark://MacBook-Pro-de-Paulo-2.local:7077}} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ 3.0.0}} > {{ }} > {{ Spark Master at spark://MacBook-Pro-de-Paulo-2.local:7077}} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ URL: > spark://MacBook-Pro-de-Paulo-2.local:7077}} > ... > {quote} > Instead of the metrics I'm getting an HTML page. The same happens for all of > those here: > {quote}{{$ curl -s [http://localhost:8080/metrics/applications/prometheus/]}} > {{$ curl -s
[jira] [Updated] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column
[ https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33585: --- Affects Version/s: (was: 2.4.7) (was: 3.0.1) 3.0.2 2.4.8 > The comment for SQLContext.tables() doesn't mention the `database` column > - > > Key: SPARK-33585 > URL: https://issues.apache.org/jira/browse/SPARK-33585 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > > The comment says: "The returned DataFrame has two columns, tableName and > isTemporary": > https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664 > but actually the dataframe has 3 columns: > {code:scala} > scala> spark.range(10).createOrReplaceTempView("view1") > scala> val tables = spark.sqlContext.tables() > tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string > ... 1 more field] > scala> tables.printSchema > root > |-- database: string (nullable = false) > |-- tableName: string (nullable = false) > |-- isTemporary: boolean (nullable = false) > scala> tables.show > ++-+---+ > |database|tableName|isTemporary| > ++-+---+ > | default| t1| false| > | default| t2| false| > | default| ymd| false| > ||view1| true| > ++-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33580) resolveDependencyPaths should use classifier attribute of artifact
[ https://issues.apache.org/jira/browse/SPARK-33580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33580. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30524 [https://github.com/apache/spark/pull/30524] > resolveDependencyPaths should use classifier attribute of artifact > -- > > Key: SPARK-33580 > URL: https://issues.apache.org/jira/browse/SPARK-33580 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > `resolveDependencyPaths` now takes artifact type to decide to add "-tests" > postfix. However, the path pattern of ivy in `resolveMavenCoordinates` is > "[organization]_[artifact]-[revision](-[classifier]).[ext]". We should use > classifier instead of type to construct file path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`
[ https://issues.apache.org/jira/browse/SPARK-33588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33588: Assignee: Apache Spark > Partition spec in SHOW TABLE EXTENDED doesn't respect > `spark.sql.caseSensitive` > --- > > Key: SPARK-33588 > URL: https://issues.apache.org/jira/browse/SPARK-33588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > > USING parquet > > partitioned by (year, month); > spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; > spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); > Error in query: Partition spec is invalid. The spec (YEAR, Month) must match > the partition spec (year, month) defined in table '`default`.`tbl1`'; > {code} > The spark.sql.caseSensitive flag is false by default, so, the partition spec > is valid. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`
[ https://issues.apache.org/jira/browse/SPARK-33588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33588: Assignee: (was: Apache Spark) > Partition spec in SHOW TABLE EXTENDED doesn't respect > `spark.sql.caseSensitive` > --- > > Key: SPARK-33588 > URL: https://issues.apache.org/jira/browse/SPARK-33588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > > USING parquet > > partitioned by (year, month); > spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; > spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); > Error in query: Partition spec is invalid. The spec (YEAR, Month) must match > the partition spec (year, month) defined in table '`default`.`tbl1`'; > {code} > The spark.sql.caseSensitive flag is false by default, so, the partition spec > is valid. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`
[ https://issues.apache.org/jira/browse/SPARK-33588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240045#comment-17240045 ] Apache Spark commented on SPARK-33588: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30529 > Partition spec in SHOW TABLE EXTENDED doesn't respect > `spark.sql.caseSensitive` > --- > > Key: SPARK-33588 > URL: https://issues.apache.org/jira/browse/SPARK-33588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > > USING parquet > > partitioned by (year, month); > spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; > spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); > Error in query: Partition spec is invalid. The spec (YEAR, Month) must match > the partition spec (year, month) defined in table '`default`.`tbl1`'; > {code} > The spark.sql.caseSensitive flag is false by default, so, the partition spec > is valid. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`
Maxim Gekk created SPARK-33588: -- Summary: Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive` Key: SPARK-33588 URL: https://issues.apache.org/jira/browse/SPARK-33588 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.8, 3.0.2, 3.1.0 Reporter: Maxim Gekk For example: {code:sql} spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > USING parquet > partitioned by (year, month); spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); Error in query: Partition spec is invalid. The spec (YEAR, Month) must match the partition spec (year, month) defined in table '`default`.`tbl1`'; {code} The spark.sql.caseSensitive flag is false by default, so, the partition spec is valid. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33587) Kill the executor on nested fatal errors
[ https://issues.apache.org/jira/browse/SPARK-33587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33587: Assignee: Apache Spark > Kill the executor on nested fatal errors > > > Key: SPARK-33587 > URL: https://issues.apache.org/jira/browse/SPARK-33587 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > Currently we kill the executor when hitting a fatal error. However, if the > fatal error is wrapped by another exception, such as > - java.util.concurrent.ExecutionException, > com.google.common.util.concurrent.UncheckedExecutionException, > com.google.common.util.concurrent.ExecutionError when using Guava cache and > java thread pool. > - SparkException thrown from this line: > https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231 > We will still keep the executor running. Fatal errors are usually > unrecoverable (such as OutOfMemoryError), some components may be in a broken > state when hitting a fatal error. Hence, it's better to detect the nested > fatal error as well and kill the executor. Then we can rely on Spark's fault > tolerance to recover. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33587) Kill the executor on nested fatal errors
[ https://issues.apache.org/jira/browse/SPARK-33587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33587: Assignee: (was: Apache Spark) > Kill the executor on nested fatal errors > > > Key: SPARK-33587 > URL: https://issues.apache.org/jira/browse/SPARK-33587 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Shixiong Zhu >Priority: Major > > Currently we kill the executor when hitting a fatal error. However, if the > fatal error is wrapped by another exception, such as > - java.util.concurrent.ExecutionException, > com.google.common.util.concurrent.UncheckedExecutionException, > com.google.common.util.concurrent.ExecutionError when using Guava cache and > java thread pool. > - SparkException thrown from this line: > https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231 > We will still keep the executor running. Fatal errors are usually > unrecoverable (such as OutOfMemoryError), some components may be in a broken > state when hitting a fatal error. Hence, it's better to detect the nested > fatal error as well and kill the executor. Then we can rely on Spark's fault > tolerance to recover. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33587) Kill the executor on nested fatal errors
[ https://issues.apache.org/jira/browse/SPARK-33587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240029#comment-17240029 ] Apache Spark commented on SPARK-33587: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/30528 > Kill the executor on nested fatal errors > > > Key: SPARK-33587 > URL: https://issues.apache.org/jira/browse/SPARK-33587 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Shixiong Zhu >Priority: Major > > Currently we kill the executor when hitting a fatal error. However, if the > fatal error is wrapped by another exception, such as > - java.util.concurrent.ExecutionException, > com.google.common.util.concurrent.UncheckedExecutionException, > com.google.common.util.concurrent.ExecutionError when using Guava cache and > java thread pool. > - SparkException thrown from this line: > https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231 > We will still keep the executor running. Fatal errors are usually > unrecoverable (such as OutOfMemoryError), some components may be in a broken > state when hitting a fatal error. Hence, it's better to detect the nested > fatal error as well and kill the executor. Then we can rely on Spark's fault > tolerance to recover. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33587) Kill the executor on nested fatal errors
Shixiong Zhu created SPARK-33587: Summary: Kill the executor on nested fatal errors Key: SPARK-33587 URL: https://issues.apache.org/jira/browse/SPARK-33587 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: Shixiong Zhu Currently we kill the executor when hitting a fatal error. However, if the fatal error is wrapped by another exception, such as - java.util.concurrent.ExecutionException, com.google.common.util.concurrent.UncheckedExecutionException, com.google.common.util.concurrent.ExecutionError when using Guava cache and java thread pool. - SparkException thrown from this line: https://github.com/apache/spark/blob/cf98a761de677c733f3c33230e1c63ddb785d5c5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L231 We will still keep the executor running. Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33586) BisectingKMeansModel save and load implementation in pyspark
Iman Kermani created SPARK-33586: Summary: BisectingKMeansModel save and load implementation in pyspark Key: SPARK-33586 URL: https://issues.apache.org/jira/browse/SPARK-33586 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 3.0.1 Environment: Spark 3.0.1 with Hadoop 2.7 Reporter: Iman Kermani BisectingKMeansModel save and load functions are implemented in Java and Scala. It would be nice if it was implemented in pyspark too. Thanks in advance -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239994#comment-17239994 ] Kousuke Saruta commented on SPARK-33570: Issue resolved by pull request 30515 https://github.com/apache/spark/pull/30515 > Set the proper version of gssapi plugin automatically for > MariaDBKrbIntegrationSuite > > > Key: SPARK-33570 > URL: https://issues.apache.org/jira/browse/SPARK-33570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server > is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer > available in the official apt repository and MariaDBKrbIntegrationSuite > doesn't pass for now. > It seems that only the most recent three versions are available and they are > 10.5.6, 10.5.7 and 10.5.8 for now. > Further, the release cycle of MariaDB seems to be very rapid (1 ~ 2 months) > so I don't think it's a good idea to set to an specific version for > mariadb-plugin-gssapi-server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-33570. Fix Version/s: 3.1.0 Resolution: Fixed > Set the proper version of gssapi plugin automatically for > MariaDBKrbIntegrationSuite > > > Key: SPARK-33570 > URL: https://issues.apache.org/jira/browse/SPARK-33570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server > is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer > available in the official apt repository and MariaDBKrbIntegrationSuite > doesn't pass for now. > It seems that only the most recent three versions are available and they are > 10.5.6, 10.5.7 and 10.5.8 for now. > Further, the release cycle of MariaDB seems to be very rapid (1 ~ 2 months) > so I don't think it's a good idea to set to an specific version for > mariadb-plugin-gssapi-server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-33570: --- Summary: Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationSuite (was: Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationsuite) > Set the proper version of gssapi plugin automatically for > MariaDBKrbIntegrationSuite > > > Key: SPARK-33570 > URL: https://issues.apache.org/jira/browse/SPARK-33570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server > is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer > available in the official apt repository and MariaDBKrbIntegrationSuite > doesn't pass for now. > It seems that only the most recent three versions are available and they are > 10.5.6, 10.5.7 and 10.5.8 for now. > Further, the release cycle of MariaDB seems to be very rapid (1 ~ 2 months) > so I don't think it's a good idea to set to an specific version for > mariadb-plugin-gssapi-server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column
[ https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33585: Assignee: (was: Apache Spark) > The comment for SQLContext.tables() doesn't mention the `database` column > - > > Key: SPARK-33585 > URL: https://issues.apache.org/jira/browse/SPARK-33585 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > > The comment says: "The returned DataFrame has two columns, tableName and > isTemporary": > https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664 > but actually the dataframe has 3 columns: > {code:scala} > scala> spark.range(10).createOrReplaceTempView("view1") > scala> val tables = spark.sqlContext.tables() > tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string > ... 1 more field] > scala> tables.printSchema > root > |-- database: string (nullable = false) > |-- tableName: string (nullable = false) > |-- isTemporary: boolean (nullable = false) > scala> tables.show > ++-+---+ > |database|tableName|isTemporary| > ++-+---+ > | default| t1| false| > | default| t2| false| > | default| ymd| false| > ||view1| true| > ++-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column
[ https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239991#comment-17239991 ] Apache Spark commented on SPARK-33585: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30526 > The comment for SQLContext.tables() doesn't mention the `database` column > - > > Key: SPARK-33585 > URL: https://issues.apache.org/jira/browse/SPARK-33585 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > > The comment says: "The returned DataFrame has two columns, tableName and > isTemporary": > https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664 > but actually the dataframe has 3 columns: > {code:scala} > scala> spark.range(10).createOrReplaceTempView("view1") > scala> val tables = spark.sqlContext.tables() > tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string > ... 1 more field] > scala> tables.printSchema > root > |-- database: string (nullable = false) > |-- tableName: string (nullable = false) > |-- isTemporary: boolean (nullable = false) > scala> tables.show > ++-+---+ > |database|tableName|isTemporary| > ++-+---+ > | default| t1| false| > | default| t2| false| > | default| ymd| false| > ||view1| true| > ++-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column
[ https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33585: Assignee: Apache Spark > The comment for SQLContext.tables() doesn't mention the `database` column > - > > Key: SPARK-33585 > URL: https://issues.apache.org/jira/browse/SPARK-33585 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > The comment says: "The returned DataFrame has two columns, tableName and > isTemporary": > https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664 > but actually the dataframe has 3 columns: > {code:scala} > scala> spark.range(10).createOrReplaceTempView("view1") > scala> val tables = spark.sqlContext.tables() > tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string > ... 1 more field] > scala> tables.printSchema > root > |-- database: string (nullable = false) > |-- tableName: string (nullable = false) > |-- isTemporary: boolean (nullable = false) > scala> tables.show > ++-+---+ > |database|tableName|isTemporary| > ++-+---+ > | default| t1| false| > | default| t2| false| > | default| ymd| false| > ||view1| true| > ++-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column
Maxim Gekk created SPARK-33585: -- Summary: The comment for SQLContext.tables() doesn't mention the `database` column Key: SPARK-33585 URL: https://issues.apache.org/jira/browse/SPARK-33585 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.0.1, 2.4.7, 3.1.0 Reporter: Maxim Gekk The comment says: "The returned DataFrame has two columns, tableName and isTemporary": https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664 but actually the dataframe has 3 columns: {code:scala} scala> spark.range(10).createOrReplaceTempView("view1") scala> val tables = spark.sqlContext.tables() tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field] scala> tables.printSchema root |-- database: string (nullable = false) |-- tableName: string (nullable = false) |-- isTemporary: boolean (nullable = false) scala> tables.show ++-+---+ |database|tableName|isTemporary| ++-+---+ | default| t1| false| | default| t2| false| | default| ymd| false| ||view1| true| ++-+---+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33584) Partition predicate pushdown into Hive metastore support cast string type to date type
[ https://issues.apache.org/jira/browse/SPARK-33584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239935#comment-17239935 ] Yuming Wang commented on SPARK-33584: - This change should after SPARK-33581, I have prepared the pr: https://github.com/wangyum/spark/tree/SPARK-33584 > Partition predicate pushdown into Hive metastore support cast string type to > date type > -- > > Key: SPARK-33584 > URL: https://issues.apache.org/jira/browse/SPARK-33584 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > {code:scala} > spark.sql("create table t1(id string) partitioned by (part string) stored as > parquet") > spark.sql("insert into t1 values('1', '2019-01-01')") > spark.sql("insert into t1 values('2', '2019-01-02')") > spark.sql("select * from t1 where part = date '2019-01-01' ").show > {code} > We can pushdown {{cast(part as date) = date '2019-01-01' }} to Hive metastore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33584) Partition predicate pushdown into Hive metastore support cast string type to date type
Yuming Wang created SPARK-33584: --- Summary: Partition predicate pushdown into Hive metastore support cast string type to date type Key: SPARK-33584 URL: https://issues.apache.org/jira/browse/SPARK-33584 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang Assignee: Yuming Wang {code:scala} spark.sql("create table t1(id string) partitioned by (part string) stored as parquet") spark.sql("insert into t1 values('1', '2019-01-01')") spark.sql("insert into t1 values('2', '2019-01-02')") spark.sql("select * from t1 where part = date '2019-01-01' ").show {code} We can pushdown {{cast(part as date) = date '2019-01-01' }} to Hive metastore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33582) Partition predicate pushdown into Hive metastore support not-equals
[ https://issues.apache.org/jira/browse/SPARK-33582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33582: Summary: Partition predicate pushdown into Hive metastore support not-equals (was: Hive partition pruning support not-equals) > Partition predicate pushdown into Hive metastore support not-equals > --- > > Key: SPARK-33582 > URL: https://issues.apache.org/jira/browse/SPARK-33582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207 > https://issues.apache.org/jira/browse/HIVE-2702 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33583) Query on large dataset with forEachPartitionAsync performance needs to improve
Miron created SPARK-33583: - Summary: Query on large dataset with forEachPartitionAsync performance needs to improve Key: SPARK-33583 URL: https://issues.apache.org/jira/browse/SPARK-33583 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Environment: Spark 2.4.4 Scala 2.11.10 Reporter: Miron Repro steps: Load 300GB of data from JSON file into a table. Note in this table field ID with reasonably well sized sets, identified by ID, some 50,000 rows a set. Issue query against this table returning DataFrame instance. Issue df.rdd.foreachPartitionAsync styled row harvesting. Place a logging line into first lambda expression, iterating over partitions as a first line. Let's say it will read "Line #1 ( some timestamp with milliseconds )" Place a logging line into nested lambda expression, reading rows, such, that it would run only when accessing first row. Let's say it will read "Line #2 ( some timestamp with milliseconds )" Once query completed take time difference in milliseconds between time noted in logging records from line #1 and line #2 above. It would be fairly reasonable to assume that the time difference should be as close to 0 as possible. In reality the difference is more then 1 second, usually more than 2. This really hurts query performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33583) Query on large dataset with foreachPartitionAsync performance needs to improve
[ https://issues.apache.org/jira/browse/SPARK-33583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miron updated SPARK-33583: -- Summary: Query on large dataset with foreachPartitionAsync performance needs to improve (was: Query on large dataset with forEachPartitionAsync performance needs to improve) > Query on large dataset with foreachPartitionAsync performance needs to improve > -- > > Key: SPARK-33583 > URL: https://issues.apache.org/jira/browse/SPARK-33583 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 > Scala 2.11.10 >Reporter: Miron >Priority: Major > > Repro steps: > Load 300GB of data from JSON file into a table. > Note in this table field ID with reasonably well sized sets, identified by > ID, some 50,000 rows a set. > Issue query against this table returning DataFrame instance. > Issue df.rdd.foreachPartitionAsync styled row harvesting. > Place a logging line into first lambda expression, iterating over partitions > as a first line. > Let's say it will read "Line #1 ( some timestamp with milliseconds )" > Place a logging line into nested lambda expression, reading rows, such, that > it would run only when accessing first row. > Let's say it will read "Line #2 ( some timestamp with milliseconds )" > Once query completed take time difference in milliseconds between time noted > in logging records from line #1 and line #2 above. > It would be fairly reasonable to assume that the time difference should be as > close to 0 as possible. In reality the difference is more then 1 second, > usually more than 2. > This really hurts query performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org