[jira] [Created] (SPARK-34024) datasourceV1 VS dataSourceV2
Zhenglin luo created SPARK-34024: Summary: datasourceV1 VS dataSourceV2 Key: SPARK-34024 URL: https://issues.apache.org/jira/browse/SPARK-34024 Project: Spark Issue Type: Question Components: Input/Output Affects Versions: 3.0.0 Reporter: Zhenglin luo I found that DataSourceV2 has been through many versions .So why hasn't datasourceV2 been used by default until now in the latest version.I want to know if it’s because there is a big difference in execution efficiency between v1 and v2. Or there are other reasons. Thanks a lot -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34023) Updated to Kryo 5.0.3
[ https://issues.apache.org/jira/browse/SPARK-34023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34023: Assignee: (was: Apache Spark) > Updated to Kryo 5.0.3 > - > > Key: SPARK-34023 > URL: https://issues.apache.org/jira/browse/SPARK-34023 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Nick Nezis >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34023) Updated to Kryo 5.0.3
[ https://issues.apache.org/jira/browse/SPARK-34023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34023: Assignee: Apache Spark > Updated to Kryo 5.0.3 > - > > Key: SPARK-34023 > URL: https://issues.apache.org/jira/browse/SPARK-34023 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Nick Nezis >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34023) Updated to Kryo 5.0.3
[ https://issues.apache.org/jira/browse/SPARK-34023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259469#comment-17259469 ] Apache Spark commented on SPARK-34023: -- User 'nicknezis' has created a pull request for this issue: https://github.com/apache/spark/pull/31059 > Updated to Kryo 5.0.3 > - > > Key: SPARK-34023 > URL: https://issues.apache.org/jira/browse/SPARK-34023 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Nick Nezis >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34011) ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
[ https://issues.apache.org/jira/browse/SPARK-34011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259468#comment-17259468 ] Apache Spark commented on SPARK-34011: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31060 > ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache > > > Key: SPARK-34011 > URL: https://issues.apache.org/jira/browse/SPARK-34011 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 3.2.0 > > > Here is the example to reproduce the issue: > {code:sql} > spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED > BY (part0); > spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; > spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; > spark-sql> CACHE TABLE tbl1; > spark-sql> SELECT * FROM tbl1; > 0 0 > 1 1 > spark-sql> ALTER TABLE tbl1 PARTITION (part0 = 0) RENAME TO PARTITION (part0 > = 2); > spark-sql> SELECT * FROM tbl1; > 0 0 > 1 1 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34023) Updated to Kryo 5.0.3
[ https://issues.apache.org/jira/browse/SPARK-34023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34023: -- Reporter: Nick Nezis (was: Dongjoon Hyun) > Updated to Kryo 5.0.3 > - > > Key: SPARK-34023 > URL: https://issues.apache.org/jira/browse/SPARK-34023 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Nick Nezis >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34023) Updated to Kryo 5.0.3
Dongjoon Hyun created SPARK-34023: - Summary: Updated to Kryo 5.0.3 Key: SPARK-34023 URL: https://issues.apache.org/jira/browse/SPARK-34023 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34022) Support latest mkdocs in SQL functions doc
[ https://issues.apache.org/jira/browse/SPARK-34022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34022: - Description: The list on the sidebar does not show the list of functions. > Support latest mkdocs in SQL functions doc > -- > > Key: SPARK-34022 > URL: https://issues.apache.org/jira/browse/SPARK-34022 > Project: Spark > Issue Type: Task > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Attachments: Screen Shot 2021-01-06 at 4.22.50 PM.png > > > The list on the sidebar does not show the list of functions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34022) Support latest mkdocs in SQL functions doc
[ https://issues.apache.org/jira/browse/SPARK-34022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34022: - Attachment: Screen Shot 2021-01-06 at 4.22.50 PM.png > Support latest mkdocs in SQL functions doc > -- > > Key: SPARK-34022 > URL: https://issues.apache.org/jira/browse/SPARK-34022 > Project: Spark > Issue Type: Task > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Attachments: Screen Shot 2021-01-06 at 4.22.50 PM.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34022) Support latest mkdocs in SQL functions doc
Hyukjin Kwon created SPARK-34022: Summary: Support latest mkdocs in SQL functions doc Key: SPARK-34022 URL: https://issues.apache.org/jira/browse/SPARK-34022 Project: Spark Issue Type: Task Components: Documentation, SQL Affects Versions: 3.1.0 Reporter: Hyukjin Kwon Attachments: Screen Shot 2021-01-06 at 4.22.50 PM.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
[ https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33948. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 31055 [https://github.com/apache/spark/pull/31055] > ExpressionEncoderSuite failed in > spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13 > --- > > Key: SPARK-33948 > URL: https://issues.apache.org/jira/browse/SPARK-33948 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.1.0 > Environment: * > >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.1.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink > > [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Double,
[jira] [Assigned] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
[ https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33948: - Assignee: Yang Jie > ExpressionEncoderSuite failed in > spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13 > --- > > Key: SPARK-33948 > URL: https://issues.apache.org/jira/browse/SPARK-33948 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.1.0 > Environment: * > >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink > > [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Double, Double)],ArrayBuffer((1.0,2.0))) (codegen >
[jira] [Commented] (SPARK-33966) Two-tier encryption key management
[ https://issues.apache.org/jira/browse/SPARK-33966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259454#comment-17259454 ] DB Tsai commented on SPARK-33966: - cc [~dongjoon] [~chaosun] and [~viirya] > Two-tier encryption key management > -- > > Key: SPARK-33966 > URL: https://issues.apache.org/jira/browse/SPARK-33966 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gidon Gershinsky >Priority: Major > > Columnar data formats (Parquet and ORC) have recently added a column > encryption capability. The data protection follows the practice of envelope > encryption, where the Data Encryption Key (DEK) is freshly generated for each > file/column, and is encrypted with a master key (or an intermediate key, that > is in turn encrypted with a master key). The master keys are kept in a > centralized Key Management Service (KMS) - meaning that each Spark worker > needs to interact with a (typically slow) KMS server. > This Jira (and its sub-tasks) introduce an alternative approach, that on one > hand preserves the best practice of generating fresh encryption keys for each > data file/column, and on the other hand allows Spark clusters to have a > scalable interaction with a KMS server, by delegating it to the application > driver. This is done via two-tier management of the keys, where a random Key > Encryption Key (KEK) is generated by the driver, encrypted by the master key > in the KMS, and distributed by the driver to the workers, so they can use it > to encrypt the DEKs, generated there by Parquet or ORC libraries. In the > workers, the KEKs are distributed to the executors/threads in the write path. > In the read path, the encrypted KEKs are fetched by workers from file > metadata, decrypted via interaction with the driver, and shared among the > executors/threads. > The KEK layer further improves scalability of the key management, because > neither driver or workers need to interact with the KMS for each file/column. > Stand-alone Parquet/ORC libraries (without Spark) and/or other frameworks > (e.g., Presto, pandas) must be able to read/decrypt the files, > written/encrypted by this Spark-driven key management mechanism - and > vice-versa. [of course, only if both sides have proper authorisation for > using the master keys in the KMS] > A link to a discussion/design doc is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259446#comment-17259446 ] Jungtaek Lim commented on SPARK-33635: -- One more point, though the root cause is actually the changes on "Kafka" - you'll find the huge difference according to the size of the log file between Spark 2.4 vs 3.0. Some "debug" log messages in Kafka 2.0 (which Spark 2.4 uses) were re-labeled to the "info" log messages in later version of Kafka (at least including Kafka 2.4 which Spark 3.0 uses). I found the changes in KafkaConsumer.seek(), and there could be more. If you feel these messages are flooding and likely affecting the performance (whereas it would be unlikely), you can change your log4j configuration to suppress it. log4j.logger.org.apache.kafka.clients.consumer.KafkaConsumer=WARN > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908 > 1 topic, 10 partiions, 1 hour queue life > (this is just one of clusters we have, I have tested on all of them and > theyall exhibit the same performance degredation) >Reporter: David Wyles >Assignee: Jungtaek Lim >Priority: Blocker > Fix For: 3.0.2, 3.1.0 > > > I have observed a slowdown in the reading of data from kafka on all of our > systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1) > I have created a sample project to isolate the problem as much as possible, > with just a read all data from a kafka topic (see > [https://github.com/codegorillauk/spark-kafka-read] ). > With 2.4.5, across multiple runs, > I get a stable read rate of 1,120,000 (1.12 mill) rows per second > With 3.0.0 or 3.0.1, across multiple runs, > I get a stable read rate of 632,000 (0.632 mil) rows per second > The represents a *44% loss in performance*. Which is, a lot. > I have been working though the spark-sql-kafka-0-10 code base, but change for > spark 3 have been ongoing for over a year and its difficult to pin point an > exact change or reason for the degradation. > I am happy to help fix this problem, but will need some assitance as I am > unfamiliar with the spark-sql-kafka-0-10 project. > > A sample of the data my test reads (note: its not parsing csv - this is just > test data) > > 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine > Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) > [C],2017/12/15 00:00:00, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33635: -- Target Version/s: 3.0.2, 3.1.0 > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908 > 1 topic, 10 partiions, 1 hour queue life > (this is just one of clusters we have, I have tested on all of them and > theyall exhibit the same performance degredation) >Reporter: David Wyles >Assignee: Jungtaek Lim >Priority: Blocker > Fix For: 3.0.2, 3.1.0 > > > I have observed a slowdown in the reading of data from kafka on all of our > systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1) > I have created a sample project to isolate the problem as much as possible, > with just a read all data from a kafka topic (see > [https://github.com/codegorillauk/spark-kafka-read] ). > With 2.4.5, across multiple runs, > I get a stable read rate of 1,120,000 (1.12 mill) rows per second > With 3.0.0 or 3.0.1, across multiple runs, > I get a stable read rate of 632,000 (0.632 mil) rows per second > The represents a *44% loss in performance*. Which is, a lot. > I have been working though the spark-sql-kafka-0-10 code base, but change for > spark 3 have been ongoing for over a year and its difficult to pin point an > exact change or reason for the degradation. > I am happy to help fix this problem, but will need some assitance as I am > unfamiliar with the spark-sql-kafka-0-10 project. > > A sample of the data my test reads (note: its not parsing csv - this is just > test data) > > 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine > Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) > [C],2017/12/15 00:00:00, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33635. --- Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 31056 [https://github.com/apache/spark/pull/31056] > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908 > 1 topic, 10 partiions, 1 hour queue life > (this is just one of clusters we have, I have tested on all of them and > theyall exhibit the same performance degredation) >Reporter: David Wyles >Assignee: Jungtaek Lim >Priority: Blocker > Fix For: 3.1.0, 3.0.2 > > > I have observed a slowdown in the reading of data from kafka on all of our > systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1) > I have created a sample project to isolate the problem as much as possible, > with just a read all data from a kafka topic (see > [https://github.com/codegorillauk/spark-kafka-read] ). > With 2.4.5, across multiple runs, > I get a stable read rate of 1,120,000 (1.12 mill) rows per second > With 3.0.0 or 3.0.1, across multiple runs, > I get a stable read rate of 632,000 (0.632 mil) rows per second > The represents a *44% loss in performance*. Which is, a lot. > I have been working though the spark-sql-kafka-0-10 code base, but change for > spark 3 have been ongoing for over a year and its difficult to pin point an > exact change or reason for the degradation. > I am happy to help fix this problem, but will need some assitance as I am > unfamiliar with the spark-sql-kafka-0-10 project. > > A sample of the data my test reads (note: its not parsing csv - this is just > test data) > > 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine > Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) > [C],2017/12/15 00:00:00, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33635: - Assignee: Jungtaek Lim > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908 > 1 topic, 10 partiions, 1 hour queue life > (this is just one of clusters we have, I have tested on all of them and > theyall exhibit the same performance degredation) >Reporter: David Wyles >Assignee: Jungtaek Lim >Priority: Blocker > > I have observed a slowdown in the reading of data from kafka on all of our > systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1) > I have created a sample project to isolate the problem as much as possible, > with just a read all data from a kafka topic (see > [https://github.com/codegorillauk/spark-kafka-read] ). > With 2.4.5, across multiple runs, > I get a stable read rate of 1,120,000 (1.12 mill) rows per second > With 3.0.0 or 3.0.1, across multiple runs, > I get a stable read rate of 632,000 (0.632 mil) rows per second > The represents a *44% loss in performance*. Which is, a lot. > I have been working though the spark-sql-kafka-0-10 code base, but change for > spark 3 have been ongoing for over a year and its difficult to pin point an > exact change or reason for the degradation. > I am happy to help fix this problem, but will need some assitance as I am > unfamiliar with the spark-sql-kafka-0-10 project. > > A sample of the data my test reads (note: its not parsing csv - this is just > test data) > > 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine > Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) > [C],2017/12/15 00:00:00, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission
[ https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259432#comment-17259432 ] Apache Spark commented on SPARK-34021: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/31058 > Fix hyper links in SparkR documentation for CRAN submission > --- > > Key: SPARK-34021 > URL: https://issues.apache.org/jira/browse/SPARK-34021 > Project: Spark > Issue Type: Task > Components: SparkR >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > CRAN submission fails due to: > {code} >Found the following (possibly) invalid URLs: > URL: http://jsonlines.org/ (moved to https://jsonlines.org/) >From: man/read.json.Rd > man/write.json.Rd >Status: 200 >Message: OK > URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to > https://dl.acm.org/doi/10.1109/MC.2009.263) >From: inst/doc/sparkr-vignettes.html >Status: 200 >Message: OK > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission
[ https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34021: Assignee: (was: Apache Spark) > Fix hyper links in SparkR documentation for CRAN submission > --- > > Key: SPARK-34021 > URL: https://issues.apache.org/jira/browse/SPARK-34021 > Project: Spark > Issue Type: Task > Components: SparkR >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > CRAN submission fails due to: > {code} >Found the following (possibly) invalid URLs: > URL: http://jsonlines.org/ (moved to https://jsonlines.org/) >From: man/read.json.Rd > man/write.json.Rd >Status: 200 >Message: OK > URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to > https://dl.acm.org/doi/10.1109/MC.2009.263) >From: inst/doc/sparkr-vignettes.html >Status: 200 >Message: OK > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission
[ https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34021: Assignee: Apache Spark > Fix hyper links in SparkR documentation for CRAN submission > --- > > Key: SPARK-34021 > URL: https://issues.apache.org/jira/browse/SPARK-34021 > Project: Spark > Issue Type: Task > Components: SparkR >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Blocker > > CRAN submission fails due to: > {code} >Found the following (possibly) invalid URLs: > URL: http://jsonlines.org/ (moved to https://jsonlines.org/) >From: man/read.json.Rd > man/write.json.Rd >Status: 200 >Message: OK > URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to > https://dl.acm.org/doi/10.1109/MC.2009.263) >From: inst/doc/sparkr-vignettes.html >Status: 200 >Message: OK > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission
[ https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259430#comment-17259430 ] Apache Spark commented on SPARK-34021: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/31058 > Fix hyper links in SparkR documentation for CRAN submission > --- > > Key: SPARK-34021 > URL: https://issues.apache.org/jira/browse/SPARK-34021 > Project: Spark > Issue Type: Task > Components: SparkR >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > CRAN submission fails due to: > {code} >Found the following (possibly) invalid URLs: > URL: http://jsonlines.org/ (moved to https://jsonlines.org/) >From: man/read.json.Rd > man/write.json.Rd >Status: 200 >Message: OK > URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to > https://dl.acm.org/doi/10.1109/MC.2009.263) >From: inst/doc/sparkr-vignettes.html >Status: 200 >Message: OK > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34008) Upgrade derby to 10.14.2.0
[ https://issues.apache.org/jira/browse/SPARK-34008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34008. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31032 [https://github.com/apache/spark/pull/31032] > Upgrade derby to 10.14.2.0 > -- > > Key: SPARK-34008 > URL: https://issues.apache.org/jira/browse/SPARK-34008 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.2.0 > > > Derby 10.14.2.0 seems to be the final release which support JDK8 as the > minimum required version so let's upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377 ] Maple edited comment on SPARK-33981 at 1/6/21, 5:48 AM: I know what's the problem. On spark history server UI,default 18080 port, the storage page is empty,even though the application is running. And on yarn ui, default 8088 port, [http://localhost:8088/proxy/application_1609225827199_0004/storage/] the storage page is not empty when application is running,if the application completes,storage page becomes empty. was (Author: 995582386): I know what's the problem. On spark history server UI,default 18080 port, the storage page is empty,even though the application is running. And on yarn ui, default 8088, [http://localhost:8088/proxy/application_1609225827199_0004/storage/] the storage page is not empty when application is running,if the application completes,storage page becomes empty. > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: Maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377 ] Maple edited comment on SPARK-33981 at 1/6/21, 5:33 AM: I know what's the problem. On spark history server UI,default 18080 port, the storage page is empty,even though the application is running. And on yarn ui, default 8088, [http://localhost:8088/proxy/application_1609225827199_0004/storage/] the storage page is not empty when application is running,if the application completes,storage page becomes empty. was (Author: 995582386): I know what's the problem. On spark history server UI,default 18080 port, the storage page is empty.even though the application is running. And on yarn ui, default 8088, [http://localhost:8088/proxy/application_1609225827199_0004/storage/] the storage page is not empty when application is running,if the application completes,storage page becomes empty. > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: Maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34017) Pass json column information via pruneColumns()
[ https://issues.apache.org/jira/browse/SPARK-34017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259423#comment-17259423 ] Ted Yu commented on SPARK-34017: For PushDownUtils#pruneColumns, I am experimenting with the following: {code} case r: SupportsPushDownRequiredColumns if SQLConf.get.nestedSchemaPruningEnabled => val JSONCapture = "get_json_object\\((.*), *(.*)\\)".r var jsonRootFields : ArrayBuffer[RootField] = ArrayBuffer() projects.map{ _.map{ f => f.toString match { case JSONCapture(column, field) => jsonRootFields += RootField(StructField(column, f.dataType, f.nullable), derivedFromAtt = false, prunedIfAnyChildAccessed = true) case _ => logDebug("else " + f) }}} val rootFields = SchemaPruning.identifyRootFields(projects, filters) ++ jsonRootFields {code} > Pass json column information via pruneColumns() > --- > > Key: SPARK-34017 > URL: https://issues.apache.org/jira/browse/SPARK-34017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushDownUtils#pruneColumns only passes root fields to > SupportsPushDownRequiredColumns implementation(s). > {code} > 2021-01-05 19:36:07,437 (Time-limited test) [DEBUG - > org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] nested schema > projection List(id#33, address#34, phone#36, get_json_object(phone#36, > $.code) AS get_json_object(phone, $.code)#37) > 2021-01-05 19:36:07,438 (Time-limited test) [DEBUG - > org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] nested schema > StructType(StructField(id,IntegerType,false), > StructField(address,StringType,true), StructField(phone,StringType,true)) > {code} > The first line shows projections and the second line shows the pruned schema. > We can see that get_json_object(phone#36, $.code) is filtered. This > expression retrieves field 'code' from phone json column. > We should allow json column information to be passed via pruneColumns(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maple updated SPARK-33981: -- Attachment: (was: image-2021-01-06-11-20-49-804.png) > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: Maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maple updated SPARK-33981: -- Attachment: (was: screenshot-1.png) > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: Maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maple updated SPARK-33981: -- Attachment: (was: image-2021-01-06-11-19-09-849.png) > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: Maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377 ] Maple edited comment on SPARK-33981 at 1/6/21, 5:31 AM: I know what's the problem. On spark history server UI,default 18080 port, the storage page is empty.even though the application is running. And on yarn ui, default 8088, [http://localhost:8088/proxy/application_1609225827199_0004/storage/] the storage page is not empty when application is running,if the application completes,storage page becomes empty. was (Author: 995582386): I know what's the problem. On spark history server UI,default 18080 port, the storage page is empty.even though the application is running. And on yarn ui, default 8088, [http://localhost:8088/proxy/application_1609225827199_0004/storage/] the storage page is not empty when application is running,if the application complete,storage page becomes empty. > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: Maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png, > image-2021-01-06-11-19-09-849.png, image-2021-01-06-11-20-49-804.png, > screenshot-1.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377 ] Maple edited comment on SPARK-33981 at 1/6/21, 5:30 AM: I know what's the problem. On spark history server UI,default 18080 port, the storage page is empty.even though the application is running. And on yarn ui, default 8088, [http://localhost:8088/proxy/application_1609225827199_0004/storage/] the storage page is not empty when application is running,if the application complete,storage page becomes empty. was (Author: 995582386): I try in Spark 2.4.7,it still exists. !image-2021-01-06-11-20-49-804.png! !screenshot-1.png!!image-2021-01-06-11-19-09-849.png! > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: Maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png, > image-2021-01-06-11-19-09-849.png, image-2021-01-06-11-20-49-804.png, > screenshot-1.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission
[ https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34021: - Target Version/s: 3.1.0 > Fix hyper links in SparkR documentation for CRAN submission > --- > > Key: SPARK-34021 > URL: https://issues.apache.org/jira/browse/SPARK-34021 > Project: Spark > Issue Type: Task > Components: SparkR >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > CRAN submission fails due to: > {code} >Found the following (possibly) invalid URLs: > URL: http://jsonlines.org/ (moved to https://jsonlines.org/) >From: man/read.json.Rd > man/write.json.Rd >Status: 200 >Message: OK > URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to > https://dl.acm.org/doi/10.1109/MC.2009.263) >From: inst/doc/sparkr-vignettes.html >Status: 200 >Message: OK > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission
[ https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34021: - Priority: Blocker (was: Critical) > Fix hyper links in SparkR documentation for CRAN submission > --- > > Key: SPARK-34021 > URL: https://issues.apache.org/jira/browse/SPARK-34021 > Project: Spark > Issue Type: Task > Components: SparkR >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > CRAN submission fails due to: > {code} >Found the following (possibly) invalid URLs: > URL: http://jsonlines.org/ (moved to https://jsonlines.org/) >From: man/read.json.Rd > man/write.json.Rd >Status: 200 >Message: OK > URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to > https://dl.acm.org/doi/10.1109/MC.2009.263) >From: inst/doc/sparkr-vignettes.html >Status: 200 >Message: OK > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34020) IndexOutOfBoundsException on merge of two pyspark frames
[ https://issues.apache.org/jira/browse/SPARK-34020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darshat resolved SPARK-34020. - Resolution: Invalid The call stack pointed to an issue with apply, not the join. I think this may be the udf limit issue again. > IndexOutOfBoundsException on merge of two pyspark frames > > > Key: SPARK-34020 > URL: https://issues.apache.org/jira/browse/SPARK-34020 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Darshat >Priority: Major > > We are using databricks on Azure, with Apache spack 3.0.0 and Scala 2.12. > When two tables are joined - one with 36million rows, other with 4k rows we > get an IndexOutOfBoundsException with arrow on the call stack. > The cluster has 72 nodes and 288 cores. Workers have 16gb memory overall. The > spark.sql.shuffle.partitions is set to 288. > If the join key has uneven distribution, we tried to also partition it into > 1000 partitions of the join key using repartition but results in same error. > Any pointers on what can be causing this issue would be very helpful. Thanks, > Darshat > {{21/01/06 04:05:06 ERROR ArrowPythonRunner: Python worker exited > unexpectedly (crashed)}} > {{org.apache.spark.api.python.PythonException: Traceback (most recent call > last):}} > {{ File "/databricks/spark/python/pyspark/worker.py", line 640, in main}} > {{ eval_type = read_int(infile)}} > {{ File "/databricks/spark/python/pyspark/serializers.py", line 603, in > read_int}} > {{ raise EOFError}} > {{EOFError}}{{at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:585)}} > {{ at > org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:99)}} > {{ at > org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49)}} > {{ at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:538)}} > {{ at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)}} > {{ at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)}} > {{ at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)}} > {{ at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage16.processNext(Unknown > Source)}} > {{ at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)}} > {{ at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)}} > {{ at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)}} > {{ at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)}} > {{ at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)}} > {{ at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)}} > {{ at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)}} > {{ at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)}} > {{ at org.apache.spark.scheduler.Task.run(Task.scala:117)}} > {{ at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)}} > {{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)}} > {{ at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)}} > {{ at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}} > {{ at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}} > {{ at java.lang.Thread.run(Thread.java:748)}} > {{Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: > 1073741824 (expected: range(0, 0))}} > {{ at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)}} > {{ at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)}} > {{ at > org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)}} > {{ at > org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)}} > {{ at > org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)}} > {{ at > org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:278)}} > {{ at > org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:139)}} > {{ at > org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:93)}} > {{ at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)}} > {{ at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)}} > {{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)}} > {{ at >
[jira] [Created] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission
Hyukjin Kwon created SPARK-34021: Summary: Fix hyper links in SparkR documentation for CRAN submission Key: SPARK-34021 URL: https://issues.apache.org/jira/browse/SPARK-34021 Project: Spark Issue Type: Task Components: SparkR Affects Versions: 3.1.0 Reporter: Hyukjin Kwon CRAN submission fails due to: {code} Found the following (possibly) invalid URLs: URL: http://jsonlines.org/ (moved to https://jsonlines.org/) From: man/read.json.Rd man/write.json.Rd Status: 200 Message: OK URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to https://dl.acm.org/doi/10.1109/MC.2009.263) From: inst/doc/sparkr-vignettes.html Status: 200 Message: OK {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted
[ https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259415#comment-17259415 ] Apache Spark commented on SPARK-33029: -- User 'baohe-zhang' has created a pull request for this issue: https://github.com/apache/spark/pull/31057 > Standalone mode blacklist executors page UI marks driver as blacklisted > --- > > Key: SPARK-33029 > URL: https://issues.apache.org/jira/browse/SPARK-33029 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Baohe Zhang >Priority: Major > Fix For: 3.1.0 > > Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png, Screen Shot > 2020-09-29 at 1.53.37 PM.png > > > I am running a spark shell on a 1 node standalone cluster. I noticed that > the executors page ui was marking the driver as blacklisted for the stage > that is running. Attached a screen shot. > Also, in my case one of the executors died and it doesn't seem like the > schedule rpicked up the new one. It doesn't show up on the stages page and > just shows it as active but none of the tasks ran there. > > You can reproduce this by starting a master and slave on a single node, then > launch a shell like where you will get multiple executors (in this case I got > 3) > $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 > --conf spark.blacklist.enabled=true > > From shell run: > {code:java} > import org.apache.spark.TaskContext > val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it => > val context = TaskContext.get() > if (context.attemptNumber() < 2) { > throw new Exception("test attempt num") > } > it > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-33635: - Priority: Blocker (was: Major) > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908 > 1 topic, 10 partiions, 1 hour queue life > (this is just one of clusters we have, I have tested on all of them and > theyall exhibit the same performance degredation) >Reporter: David Wyles >Priority: Blocker > > I have observed a slowdown in the reading of data from kafka on all of our > systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1) > I have created a sample project to isolate the problem as much as possible, > with just a read all data from a kafka topic (see > [https://github.com/codegorillauk/spark-kafka-read] ). > With 2.4.5, across multiple runs, > I get a stable read rate of 1,120,000 (1.12 mill) rows per second > With 3.0.0 or 3.0.1, across multiple runs, > I get a stable read rate of 632,000 (0.632 mil) rows per second > The represents a *44% loss in performance*. Which is, a lot. > I have been working though the spark-sql-kafka-0-10 code base, but change for > spark 3 have been ongoing for over a year and its difficult to pin point an > exact change or reason for the degradation. > I am happy to help fix this problem, but will need some assitance as I am > unfamiliar with the spark-sql-kafka-0-10 project. > > A sample of the data my test reads (note: its not parsing csv - this is just > test data) > > 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine > Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) > [C],2017/12/15 00:00:00, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34004) Change FrameLessOffsetWindowFunction as sealed abstract class
[ https://issues.apache.org/jira/browse/SPARK-34004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34004. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31026 [https://github.com/apache/spark/pull/31026] > Change FrameLessOffsetWindowFunction as sealed abstract class > - > > Key: SPARK-34004 > URL: https://issues.apache.org/jira/browse/SPARK-34004 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.2.0 > > > Change FrameLessOffsetWindowFunction as sealed abstract class so that > simplify pattern match. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34004) Change FrameLessOffsetWindowFunction as sealed abstract class
[ https://issues.apache.org/jira/browse/SPARK-34004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34004: - Assignee: jiaan.geng > Change FrameLessOffsetWindowFunction as sealed abstract class > - > > Key: SPARK-34004 > URL: https://issues.apache.org/jira/browse/SPARK-34004 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > Change FrameLessOffsetWindowFunction as sealed abstract class so that > simplify pattern match. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33635: Assignee: Apache Spark > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908 > 1 topic, 10 partiions, 1 hour queue life > (this is just one of clusters we have, I have tested on all of them and > theyall exhibit the same performance degredation) >Reporter: David Wyles >Assignee: Apache Spark >Priority: Major > > I have observed a slowdown in the reading of data from kafka on all of our > systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1) > I have created a sample project to isolate the problem as much as possible, > with just a read all data from a kafka topic (see > [https://github.com/codegorillauk/spark-kafka-read] ). > With 2.4.5, across multiple runs, > I get a stable read rate of 1,120,000 (1.12 mill) rows per second > With 3.0.0 or 3.0.1, across multiple runs, > I get a stable read rate of 632,000 (0.632 mil) rows per second > The represents a *44% loss in performance*. Which is, a lot. > I have been working though the spark-sql-kafka-0-10 code base, but change for > spark 3 have been ongoing for over a year and its difficult to pin point an > exact change or reason for the degradation. > I am happy to help fix this problem, but will need some assitance as I am > unfamiliar with the spark-sql-kafka-0-10 project. > > A sample of the data my test reads (note: its not parsing csv - this is just > test data) > > 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine > Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) > [C],2017/12/15 00:00:00, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33635: Assignee: (was: Apache Spark) > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908 > 1 topic, 10 partiions, 1 hour queue life > (this is just one of clusters we have, I have tested on all of them and > theyall exhibit the same performance degredation) >Reporter: David Wyles >Priority: Major > > I have observed a slowdown in the reading of data from kafka on all of our > systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1) > I have created a sample project to isolate the problem as much as possible, > with just a read all data from a kafka topic (see > [https://github.com/codegorillauk/spark-kafka-read] ). > With 2.4.5, across multiple runs, > I get a stable read rate of 1,120,000 (1.12 mill) rows per second > With 3.0.0 or 3.0.1, across multiple runs, > I get a stable read rate of 632,000 (0.632 mil) rows per second > The represents a *44% loss in performance*. Which is, a lot. > I have been working though the spark-sql-kafka-0-10 code base, but change for > spark 3 have been ongoing for over a year and its difficult to pin point an > exact change or reason for the degradation. > I am happy to help fix this problem, but will need some assitance as I am > unfamiliar with the spark-sql-kafka-0-10 project. > > A sample of the data my test reads (note: its not parsing csv - this is just > test data) > > 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine > Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) > [C],2017/12/15 00:00:00, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259407#comment-17259407 ] Apache Spark commented on SPARK-33635: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/31056 > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908 > 1 topic, 10 partiions, 1 hour queue life > (this is just one of clusters we have, I have tested on all of them and > theyall exhibit the same performance degredation) >Reporter: David Wyles >Priority: Major > > I have observed a slowdown in the reading of data from kafka on all of our > systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1) > I have created a sample project to isolate the problem as much as possible, > with just a read all data from a kafka topic (see > [https://github.com/codegorillauk/spark-kafka-read] ). > With 2.4.5, across multiple runs, > I get a stable read rate of 1,120,000 (1.12 mill) rows per second > With 3.0.0 or 3.0.1, across multiple runs, > I get a stable read rate of 632,000 (0.632 mil) rows per second > The represents a *44% loss in performance*. Which is, a lot. > I have been working though the spark-sql-kafka-0-10 code base, but change for > spark 3 have been ongoing for over a year and its difficult to pin point an > exact change or reason for the degradation. > I am happy to help fix this problem, but will need some assitance as I am > unfamiliar with the spark-sql-kafka-0-10 project. > > A sample of the data my test reads (note: its not parsing csv - this is just > test data) > > 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine > Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) > [C],2017/12/15 00:00:00, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
[ https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259406#comment-17259406 ] Apache Spark commented on SPARK-33948: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/31055 > ExpressionEncoderSuite failed in > spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13 > --- > > Key: SPARK-33948 > URL: https://issues.apache.org/jira/browse/SPARK-33948 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.1.0 > Environment: * > >Reporter: Yang Jie >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink > > [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Double, Double)],ArrayBuffer((1.0,2.0))) (codegen >
[jira] [Commented] (SPARK-33100) Support parse the sql statements with c-style comments
[ https://issues.apache.org/jira/browse/SPARK-33100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259405#comment-17259405 ] Apache Spark commented on SPARK-33100: -- User 'turboFei' has created a pull request for this issue: https://github.com/apache/spark/pull/31054 > Support parse the sql statements with c-style comments > -- > > Key: SPARK-33100 > URL: https://issues.apache.org/jira/browse/SPARK-33100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: feiwang >Assignee: feiwang >Priority: Minor > Fix For: 3.1.0, 3.2.0 > > > Now the spark-sql does not support parse the sql statements with C-style > comments. > For the sql statements: > {code:java} > /* SELECT 'test'; */ > SELECT 'test'; > {code} > Would be split to two statements: > The first: "/* SELECT 'test'" > The second: "*/ SELECT 'test'" > Then it would throw an exception because the first one is illegal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
[ https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33948: Assignee: Apache Spark > ExpressionEncoderSuite failed in > spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13 > --- > > Key: SPARK-33948 > URL: https://issues.apache.org/jira/browse/SPARK-33948 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.1.0 > Environment: * > >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink > > [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Double, Double)],ArrayBuffer((1.0,2.0))) (codegen >
[jira] [Commented] (SPARK-33100) Support parse the sql statements with c-style comments
[ https://issues.apache.org/jira/browse/SPARK-33100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259404#comment-17259404 ] Apache Spark commented on SPARK-33100: -- User 'turboFei' has created a pull request for this issue: https://github.com/apache/spark/pull/31054 > Support parse the sql statements with c-style comments > -- > > Key: SPARK-33100 > URL: https://issues.apache.org/jira/browse/SPARK-33100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: feiwang >Assignee: feiwang >Priority: Minor > Fix For: 3.1.0, 3.2.0 > > > Now the spark-sql does not support parse the sql statements with C-style > comments. > For the sql statements: > {code:java} > /* SELECT 'test'; */ > SELECT 'test'; > {code} > Would be split to two statements: > The first: "/* SELECT 'test'" > The second: "*/ SELECT 'test'" > Then it would throw an exception because the first one is illegal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
[ https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33948: Assignee: (was: Apache Spark) > ExpressionEncoderSuite failed in > spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13 > --- > > Key: SPARK-33948 > URL: https://issues.apache.org/jira/browse/SPARK-33948 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.1.0 > Environment: * > >Reporter: Yang Jie >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink > > [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Double, Double)],ArrayBuffer((1.0,2.0))) (codegen >
[jira] [Commented] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
[ https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259403#comment-17259403 ] Apache Spark commented on SPARK-33948: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/31055 > ExpressionEncoderSuite failed in > spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13 > --- > > Key: SPARK-33948 > URL: https://issues.apache.org/jira/browse/SPARK-33948 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.1.0 > Environment: * > >Reporter: Yang Jie >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink > > [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted > path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/] > * > [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode > for Tuple2: (ArrayBuffer[(Double, Double)],ArrayBuffer((1.0,2.0))) (codegen >
[jira] [Updated] (SPARK-34020) IndexOutOfBoundsException on merge of two pyspark frames
[ https://issues.apache.org/jira/browse/SPARK-34020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darshat updated SPARK-34020: Description: We are using databricks on Azure, with Apache spack 3.0.0 and Scala 2.12. When two tables are joined - one with 36million rows, other with 4k rows we get an IndexOutOfBoundsException with arrow on the call stack. The cluster has 72 nodes and 288 cores. Workers have 16gb memory overall. The spark.sql.shuffle.partitions is set to 288. If the join key has uneven distribution, we tried to also partition it into 1000 partitions of the join key using repartition but results in same error. Any pointers on what can be causing this issue would be very helpful. Thanks, Darshat {{21/01/06 04:05:06 ERROR ArrowPythonRunner: Python worker exited unexpectedly (crashed)}} {{org.apache.spark.api.python.PythonException: Traceback (most recent call last):}} {{ File "/databricks/spark/python/pyspark/worker.py", line 640, in main}} {{ eval_type = read_int(infile)}} {{ File "/databricks/spark/python/pyspark/serializers.py", line 603, in read_int}} {{ raise EOFError}} {{EOFError}}{{at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:585)}} {{ at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:99)}} {{ at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49)}} {{ at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:538)}} {{ at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)}} {{ at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)}} {{ at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)}} {{ at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage16.processNext(Unknown Source)}} {{ at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)}} {{ at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)}} {{ at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)}} {{ at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)}} {{ at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)}} {{ at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)}} {{ at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)}} {{ at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)}} {{ at org.apache.spark.scheduler.Task.run(Task.scala:117)}} {{ at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)}} {{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)}} {{ at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)}} {{ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}} {{ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}} {{ at java.lang.Thread.run(Thread.java:748)}} {{Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))}} {{ at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)}} {{ at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)}} {{ at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)}} {{ at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)}} {{ at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)}} {{ at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:278)}} {{ at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:139)}} {{ at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:93)}} {{ at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)}} {{ at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)}} {{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)}} {{ at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)}} {{ at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:465)}} {{ at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2124)}} {{ at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:257)}} {{21/01/06 04:05:06 ERROR ArrowPythonRunner: This may have been caused by a prior exception:}} {{java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))}} {{ at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)}} {{ at
[jira] [Commented] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259401#comment-17259401 ] Apache Spark commented on SPARK-32165: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/31053 > SessionState leaks SparkListener with multiple SparkSession > --- > > Key: SPARK-32165 > URL: https://issues.apache.org/jira/browse/SPARK-32165 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianjin YE >Priority: Major > > Copied from > [https://github.com/apache/spark/pull/28128#issuecomment-653102770] > > {code:java} > test("SPARK-31354: SparkContext only register one SparkSession > ApplicationEnd listener") { > val conf = new SparkConf() > .setMaster("local") > .setAppName("test-app-SPARK-31354-1") > val context = new SparkContext(conf) > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postFirstCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postSecondCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > assert(postFirstCreation == postSecondCreation) > } > {code} > The problem can be reproduced by the above code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259400#comment-17259400 ] Apache Spark commented on SPARK-32165: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/31053 > SessionState leaks SparkListener with multiple SparkSession > --- > > Key: SPARK-32165 > URL: https://issues.apache.org/jira/browse/SPARK-32165 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianjin YE >Priority: Major > > Copied from > [https://github.com/apache/spark/pull/28128#issuecomment-653102770] > > {code:java} > test("SPARK-31354: SparkContext only register one SparkSession > ApplicationEnd listener") { > val conf = new SparkConf() > .setMaster("local") > .setAppName("test-app-SPARK-31354-1") > val context = new SparkContext(conf) > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postFirstCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postSecondCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > assert(postFirstCreation == postSecondCreation) > } > {code} > The problem can be reproduced by the above code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32165: Assignee: Apache Spark > SessionState leaks SparkListener with multiple SparkSession > --- > > Key: SPARK-32165 > URL: https://issues.apache.org/jira/browse/SPARK-32165 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianjin YE >Assignee: Apache Spark >Priority: Major > > Copied from > [https://github.com/apache/spark/pull/28128#issuecomment-653102770] > > {code:java} > test("SPARK-31354: SparkContext only register one SparkSession > ApplicationEnd listener") { > val conf = new SparkConf() > .setMaster("local") > .setAppName("test-app-SPARK-31354-1") > val context = new SparkContext(conf) > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postFirstCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postSecondCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > assert(postFirstCreation == postSecondCreation) > } > {code} > The problem can be reproduced by the above code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32165: Assignee: (was: Apache Spark) > SessionState leaks SparkListener with multiple SparkSession > --- > > Key: SPARK-32165 > URL: https://issues.apache.org/jira/browse/SPARK-32165 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianjin YE >Priority: Major > > Copied from > [https://github.com/apache/spark/pull/28128#issuecomment-653102770] > > {code:java} > test("SPARK-31354: SparkContext only register one SparkSession > ApplicationEnd listener") { > val conf = new SparkConf() > .setMaster("local") > .setAppName("test-app-SPARK-31354-1") > val context = new SparkContext(conf) > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postFirstCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postSecondCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > assert(postFirstCreation == postSecondCreation) > } > {code} > The problem can be reproduced by the above code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259399#comment-17259399 ] Jungtaek Lim edited comment on SPARK-33635 at 1/6/21, 4:18 AM: --- I've spent some time to trace the issue, and noticed SPARK-29054 (+SPARK-30495) caused performance regression (though the patch itself is doing the right thing). {code} private[kafka010] def getOrRetrieveConsumer(): InternalKafkaConsumer = { if (!_consumer.isDefined) { retrieveConsumer() } require(_consumer.isDefined, "Consumer must be defined") if (KafkaTokenUtil.needTokenUpdate(SparkEnv.get.conf, _consumer.get.kafkaParamsWithSecurity, _consumer.get.clusterConfig)) { logDebug("Cached consumer uses an old delegation token, invalidating.") releaseConsumer() consumerPool.invalidateKey(cacheKey) fetchedDataPool.invalidate(cacheKey) retrieveConsumer() } _consumer.get } {code} {code} def needTokenUpdate( sparkConf: SparkConf, params: ju.Map[String, Object], clusterConfig: Option[KafkaTokenClusterConf]): Boolean = { if (HadoopDelegationTokenManager.isServiceEnabled(sparkConf, "kafka") && clusterConfig.isDefined && params.containsKey(SaslConfigs.SASL_JAAS_CONFIG)) { logDebug("Delegation token used by connector, checking if uses the latest token.") val connectorJaasParams = params.get(SaslConfigs.SASL_JAAS_CONFIG).asInstanceOf[String] getTokenJaasParams(clusterConfig.get) != connectorJaasParams } else { false } } {code} {code} def isServiceEnabled(sparkConf: SparkConf, serviceName: String): Boolean = { val key = providerEnabledConfig.format(serviceName) deprecatedProviderEnabledConfigs.foreach { pattern => val deprecatedKey = pattern.format(serviceName) if (sparkConf.contains(deprecatedKey)) { logWarning(s"${deprecatedKey} is deprecated. Please use ${key} instead.") } } val isEnabledDeprecated = deprecatedProviderEnabledConfigs.forall { pattern => sparkConf .getOption(pattern.format(serviceName)) .map(_.toBoolean) .getOrElse(true) } sparkConf .getOption(key) .map(_.toBoolean) .getOrElse(isEnabledDeprecated) } {code} With my test data and default config, Spark pulled 500 records per a poll from Kafka, which ended up "10,280,000" calls to get() which always calls getOrRetrieveConsumer(). A single call of KafkaTokenUtil.needTokenUpdate() wouldn't add significant overhead, but 10,000,000 calls make a significant difference. Assuming the case where delegation token is not applied, HadoopDelegationTokenManager.isServiceEnabled is the culprit on such huge overhead. We could probably resolve the issue via short-term solution & long-term solution. * short-term solution: change the order of check in needTokenUpdate, so that the performance hit is only affected when using delegation token. I'll raise a PR shortly. * long-term solution(s): 1) optimize HadoopDelegationTokenManager.isServiceEnabled 2) find a way to reduce the occurrence of checking necessarily of token update. Note that even with short-term solution, a slight performance hit is observed as it still does more things on the code path compared to Spark 2.4. Though I'd ignore it if it affects slightly, like less than 1%, or even slightly higher but the code addition is mandatory. was (Author: kabhwan): I've spent some time to trace the issue, and noticed SPARK-29054 (+SPARK-30495) caused performance regression (though the patch itself is doing the right thing). {code} private[kafka010] def getOrRetrieveConsumer(): InternalKafkaConsumer = { if (!_consumer.isDefined) { retrieveConsumer() } require(_consumer.isDefined, "Consumer must be defined") if (KafkaTokenUtil.needTokenUpdate(SparkEnv.get.conf, _consumer.get.kafkaParamsWithSecurity, _consumer.get.clusterConfig)) { logDebug("Cached consumer uses an old delegation token, invalidating.") releaseConsumer() consumerPool.invalidateKey(cacheKey) fetchedDataPool.invalidate(cacheKey) retrieveConsumer() } _consumer.get } {code} {code} def needTokenUpdate( sparkConf: SparkConf, params: ju.Map[String, Object], clusterConfig: Option[KafkaTokenClusterConf]): Boolean = { if (HadoopDelegationTokenManager.isServiceEnabled(sparkConf, "kafka") && clusterConfig.isDefined && params.containsKey(SaslConfigs.SASL_JAAS_CONFIG)) { logDebug("Delegation token used by connector, checking if uses the latest token.") val connectorJaasParams = params.get(SaslConfigs.SASL_JAAS_CONFIG).asInstanceOf[String] getTokenJaasParams(clusterConfig.get) != connectorJaasParams } else { false } } {code} {code} def isServiceEnabled(sparkConf: SparkConf,
[jira] [Updated] (SPARK-34020) IndexOutOfBoundsException on merge of two pyspark frames
[ https://issues.apache.org/jira/browse/SPARK-34020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darshat updated SPARK-34020: Affects Version/s: (was: 3.0.1) 3.0.0 > IndexOutOfBoundsException on merge of two pyspark frames > > > Key: SPARK-34020 > URL: https://issues.apache.org/jira/browse/SPARK-34020 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Darshat >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34020) IndexOutOfBoundsException on merge of two pyspark frames
Darshat created SPARK-34020: --- Summary: IndexOutOfBoundsException on merge of two pyspark frames Key: SPARK-34020 URL: https://issues.apache.org/jira/browse/SPARK-34020 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.1 Reporter: Darshat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259399#comment-17259399 ] Jungtaek Lim commented on SPARK-33635: -- I've spent some time to trace the issue, and noticed SPARK-29054 (+SPARK-30495) caused performance regression (though the patch itself is doing the right thing). {code} private[kafka010] def getOrRetrieveConsumer(): InternalKafkaConsumer = { if (!_consumer.isDefined) { retrieveConsumer() } require(_consumer.isDefined, "Consumer must be defined") if (KafkaTokenUtil.needTokenUpdate(SparkEnv.get.conf, _consumer.get.kafkaParamsWithSecurity, _consumer.get.clusterConfig)) { logDebug("Cached consumer uses an old delegation token, invalidating.") releaseConsumer() consumerPool.invalidateKey(cacheKey) fetchedDataPool.invalidate(cacheKey) retrieveConsumer() } _consumer.get } {code} {code} def needTokenUpdate( sparkConf: SparkConf, params: ju.Map[String, Object], clusterConfig: Option[KafkaTokenClusterConf]): Boolean = { if (HadoopDelegationTokenManager.isServiceEnabled(sparkConf, "kafka") && clusterConfig.isDefined && params.containsKey(SaslConfigs.SASL_JAAS_CONFIG)) { logDebug("Delegation token used by connector, checking if uses the latest token.") val connectorJaasParams = params.get(SaslConfigs.SASL_JAAS_CONFIG).asInstanceOf[String] getTokenJaasParams(clusterConfig.get) != connectorJaasParams } else { false } } {code} {code} def isServiceEnabled(sparkConf: SparkConf, serviceName: String): Boolean = { val key = providerEnabledConfig.format(serviceName) deprecatedProviderEnabledConfigs.foreach { pattern => val deprecatedKey = pattern.format(serviceName) if (sparkConf.contains(deprecatedKey)) { logWarning(s"${deprecatedKey} is deprecated. Please use ${key} instead.") } } val isEnabledDeprecated = deprecatedProviderEnabledConfigs.forall { pattern => sparkConf .getOption(pattern.format(serviceName)) .map(_.toBoolean) .getOrElse(true) } sparkConf .getOption(key) .map(_.toBoolean) .getOrElse(isEnabledDeprecated) } {code} With my test data creator, Spark pulled 500 records per a poll from Kafka, which ended up "10,280,000" calls to get() which always calls getOrRetrieveConsumer(). A single call of KafkaTokenUtil.needTokenUpdate() wouldn't add significant overhead, but 10,000,000 calls make a significant difference. Assuming the case where delegation token is not applied, HadoopDelegationTokenManager.isServiceEnabled is the culprit on such huge overhead. We could probably resolve the issue via short-term solution & long-term solution. * short-term solution: change the order of check in needTokenUpdate, so that the performance hit is only affected when using delegation token. I'll raise a PR shortly. * long-term solution(s): 1) optimize HadoopDelegationTokenManager.isServiceEnabled 2) find a way to reduce the occurrence of checking necessarily of token update. Note that even with short-term solution, a slight performance hit is observed as it still does more things on the code path compared to Spark 2.4. Though I'd ignore it if it affects slightly, like less than 1%, or even slightly higher but the code addition is mandatory. > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908 > 1 topic, 10 partiions, 1 hour queue life > (this is just one of clusters we have, I have tested on all of them and > theyall exhibit the same performance degredation) >Reporter: David Wyles >Priority: Major > > I have observed a slowdown in the reading of data from kafka on all of our > systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1) > I have created a sample project to isolate the problem as much as possible, > with just a read all data from a kafka topic (see > [https://github.com/codegorillauk/spark-kafka-read] ). > With 2.4.5, across multiple runs, > I get a stable read rate of 1,120,000 (1.12 mill) rows per second > With 3.0.0 or 3.0.1, across multiple runs, > I get a stable read rate of 632,000 (0.632 mil) rows per second > The represents a *44% loss in performance*. Which is, a lot. > I have been
[jira] [Assigned] (SPARK-33541) Group exception messages in catalyst/expressions
[ https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33541: Assignee: Apache Spark > Group exception messages in catalyst/expressions > > > Key: SPARK-33541 > URL: https://issues.apache.org/jira/browse/SPARK-33541 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions' > || Filename || Count || > | Cast.scala| 18 | > | ExprUtils.scala | 2 | > | Expression.scala | 8 | > | InterpretedUnsafeProjection.scala | 1 | > | ScalaUDF.scala| 2 | > | SelectedField.scala | 3 | > | SubExprEvaluationRuntime.scala| 1 | > | arithmetic.scala | 8 | > | collectionOperations.scala| 4 | > | complexTypeExtractors.scala | 3 | > | csvExpressions.scala | 3 | > | datetimeExpressions.scala | 4 | > | decimalExpressions.scala | 2 | > | generators.scala | 2 | > | higherOrderFunctions.scala| 6 | > | jsonExpressions.scala | 2 | > | literals.scala| 3 | > | misc.scala| 2 | > | namedExpressions.scala| 1 | > | ordering.scala| 1 | > | package.scala | 1 | > | regexpExpressions.scala | 1 | > | stringExpressions.scala | 1 | > | windowExpressions.scala | 5 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate' > || Filename|| Count || > | ApproximatePercentile.scala | 2 | > | HyperLogLogPlusPlus.scala | 1 | > | Percentile.scala| 1 | > | interfaces.scala| 2 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen' > || Filename|| Count || > | CodeGenerator.scala | 5 | > | javaCode.scala | 1 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects' > || Filename || Count || > | objects.scala | 12 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33541) Group exception messages in catalyst/expressions
[ https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33541: Assignee: (was: Apache Spark) > Group exception messages in catalyst/expressions > > > Key: SPARK-33541 > URL: https://issues.apache.org/jira/browse/SPARK-33541 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions' > || Filename || Count || > | Cast.scala| 18 | > | ExprUtils.scala | 2 | > | Expression.scala | 8 | > | InterpretedUnsafeProjection.scala | 1 | > | ScalaUDF.scala| 2 | > | SelectedField.scala | 3 | > | SubExprEvaluationRuntime.scala| 1 | > | arithmetic.scala | 8 | > | collectionOperations.scala| 4 | > | complexTypeExtractors.scala | 3 | > | csvExpressions.scala | 3 | > | datetimeExpressions.scala | 4 | > | decimalExpressions.scala | 2 | > | generators.scala | 2 | > | higherOrderFunctions.scala| 6 | > | jsonExpressions.scala | 2 | > | literals.scala| 3 | > | misc.scala| 2 | > | namedExpressions.scala| 1 | > | ordering.scala| 1 | > | package.scala | 1 | > | regexpExpressions.scala | 1 | > | stringExpressions.scala | 1 | > | windowExpressions.scala | 5 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate' > || Filename|| Count || > | ApproximatePercentile.scala | 2 | > | HyperLogLogPlusPlus.scala | 1 | > | Percentile.scala| 1 | > | interfaces.scala| 2 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen' > || Filename|| Count || > | CodeGenerator.scala | 5 | > | javaCode.scala | 1 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects' > || Filename || Count || > | objects.scala | 12 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33541) Group exception messages in catalyst/expressions
[ https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259398#comment-17259398 ] Apache Spark commented on SPARK-33541: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/31052 > Group exception messages in catalyst/expressions > > > Key: SPARK-33541 > URL: https://issues.apache.org/jira/browse/SPARK-33541 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions' > || Filename || Count || > | Cast.scala| 18 | > | ExprUtils.scala | 2 | > | Expression.scala | 8 | > | InterpretedUnsafeProjection.scala | 1 | > | ScalaUDF.scala| 2 | > | SelectedField.scala | 3 | > | SubExprEvaluationRuntime.scala| 1 | > | arithmetic.scala | 8 | > | collectionOperations.scala| 4 | > | complexTypeExtractors.scala | 3 | > | csvExpressions.scala | 3 | > | datetimeExpressions.scala | 4 | > | decimalExpressions.scala | 2 | > | generators.scala | 2 | > | higherOrderFunctions.scala| 6 | > | jsonExpressions.scala | 2 | > | literals.scala| 3 | > | misc.scala| 2 | > | namedExpressions.scala| 1 | > | ordering.scala| 1 | > | package.scala | 1 | > | regexpExpressions.scala | 1 | > | stringExpressions.scala | 1 | > | windowExpressions.scala | 5 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate' > || Filename|| Count || > | ApproximatePercentile.scala | 2 | > | HyperLogLogPlusPlus.scala | 1 | > | Percentile.scala| 1 | > | interfaces.scala| 2 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen' > || Filename|| Count || > | CodeGenerator.scala | 5 | > | javaCode.scala | 1 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects' > || Filename || Count || > | objects.scala | 12 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32685) Script transform hive serde default field.delimit is '\t'
[ https://issues.apache.org/jira/browse/SPARK-32685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259395#comment-17259395 ] Apache Spark commented on SPARK-32685: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/31051 > Script transform hive serde default field.delimit is '\t' > - > > Key: SPARK-32685 > URL: https://issues.apache.org/jira/browse/SPARK-32685 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > > {code:java} > select split(value, "\t") from ( > SELECT TRANSFORM(a, b, c, null) > USING 'cat' > FROM (select 1 as a, 2 as b, 3 as c) t > ) temp; > result is : > _c0 > ["2","3","\\N"]{code} > > {code:java} > select split(value, "\t") from ( > SELECT TRANSFORM(a, b, c, null) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > USING 'cat' > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( >'serialization.last.column.takes.rest' = 'true' > ) > FROM (select 1 as a, 2 as b, 3 as c) t > ) temp; > result is : > _c0 > ["2","3","\\N"]{code} > > > > {code:java} > select split(value, "\t") from ( > SELECT TRANSFORM(a, b, c, null) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > USING 'cat' > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > FROM (select 1 as a, 2 as b, 3 as c) t > ) temp; > result is : > _c0 > ["2"] > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32685) Script transform hive serde default field.delimit is '\t'
[ https://issues.apache.org/jira/browse/SPARK-32685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259394#comment-17259394 ] Apache Spark commented on SPARK-32685: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/31051 > Script transform hive serde default field.delimit is '\t' > - > > Key: SPARK-32685 > URL: https://issues.apache.org/jira/browse/SPARK-32685 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > > {code:java} > select split(value, "\t") from ( > SELECT TRANSFORM(a, b, c, null) > USING 'cat' > FROM (select 1 as a, 2 as b, 3 as c) t > ) temp; > result is : > _c0 > ["2","3","\\N"]{code} > > {code:java} > select split(value, "\t") from ( > SELECT TRANSFORM(a, b, c, null) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > USING 'cat' > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( >'serialization.last.column.takes.rest' = 'true' > ) > FROM (select 1 as a, 2 as b, 3 as c) t > ) temp; > result is : > _c0 > ["2","3","\\N"]{code} > > > > {code:java} > select split(value, "\t") from ( > SELECT TRANSFORM(a, b, c, null) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > USING 'cat' > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > FROM (select 1 as a, 2 as b, 3 as c) t > ) temp; > result is : > _c0 > ["2"] > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33982) Sparksql does not support when the inserted table is a read table
[ https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33982. -- Resolution: Duplicate > Sparksql does not support when the inserted table is a read table > - > > Key: SPARK-33982 > URL: https://issues.apache.org/jira/browse/SPARK-33982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: hao >Priority: Major > > When the inserted table is a read table, sparksql will throw an error - > > org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is > also being read from.; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33982) Sparksql does not support when the inserted table is a read table
[ https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259391#comment-17259391 ] hao commented on SPARK-33982: - [~hyukjin.kwon] 小哥,能看懂中文的呀!是的这是同一个问题>.< > Sparksql does not support when the inserted table is a read table > - > > Key: SPARK-33982 > URL: https://issues.apache.org/jira/browse/SPARK-33982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: hao >Priority: Major > > When the inserted table is a read table, sparksql will throw an error - > > org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is > also being read from.; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377 ] maple edited comment on SPARK-33981 at 1/6/21, 3:20 AM: I try in Spark 2.4.7,it still exists. !image-2021-01-06-11-20-49-804.png! !screenshot-1.png!!image-2021-01-06-11-19-09-849.png! was (Author: 995582386): I try in Spark 2.4.7,it still exists. !screenshot-1.png!!image-2021-01-06-11-19-09-849.png! > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png, > image-2021-01-06-11-19-09-849.png, image-2021-01-06-11-20-49-804.png, > screenshot-1.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377 ] maple commented on SPARK-33981: --- I try in Spark 2.4.7,it still exists. !screenshot-1.png!!image-2021-01-06-11-19-09-849.png! > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png, > image-2021-01-06-11-19-09-849.png, screenshot-1.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] maple updated SPARK-33981: -- Attachment: image-2021-01-06-11-19-09-849.png > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png, > image-2021-01-06-11-19-09-849.png, screenshot-1.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] maple updated SPARK-33981: -- Attachment: screenshot-1.png > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png, screenshot-1.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted
[ https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33029. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30954 [https://github.com/apache/spark/pull/30954] > Standalone mode blacklist executors page UI marks driver as blacklisted > --- > > Key: SPARK-33029 > URL: https://issues.apache.org/jira/browse/SPARK-33029 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Baohe Zhang >Priority: Major > Fix For: 3.1.0 > > Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png, Screen Shot > 2020-09-29 at 1.53.37 PM.png > > > I am running a spark shell on a 1 node standalone cluster. I noticed that > the executors page ui was marking the driver as blacklisted for the stage > that is running. Attached a screen shot. > Also, in my case one of the executors died and it doesn't seem like the > schedule rpicked up the new one. It doesn't show up on the stages page and > just shows it as active but none of the tasks ran there. > > You can reproduce this by starting a master and slave on a single node, then > launch a shell like where you will get multiple executors (in this case I got > 3) > $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 > --conf spark.blacklist.enabled=true > > From shell run: > {code:java} > import org.apache.spark.TaskContext > val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it => > val context = TaskContext.get() > if (context.attemptNumber() < 2) { > throw new Exception("test attempt num") > } > it > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted
[ https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33029: - Assignee: Baohe Zhang > Standalone mode blacklist executors page UI marks driver as blacklisted > --- > > Key: SPARK-33029 > URL: https://issues.apache.org/jira/browse/SPARK-33029 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Baohe Zhang >Priority: Major > Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png, Screen Shot > 2020-09-29 at 1.53.37 PM.png > > > I am running a spark shell on a 1 node standalone cluster. I noticed that > the executors page ui was marking the driver as blacklisted for the stage > that is running. Attached a screen shot. > Also, in my case one of the executors died and it doesn't seem like the > schedule rpicked up the new one. It doesn't show up on the stages page and > just shows it as active but none of the tasks ran there. > > You can reproduce this by starting a master and slave on a single node, then > launch a shell like where you will get multiple executors (in this case I got > 3) > $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 > --conf spark.blacklist.enabled=true > > From shell run: > {code:java} > import org.apache.spark.TaskContext > val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it => > val context = TaskContext.get() > if (context.attemptNumber() < 2) { > throw new Exception("test attempt num") > } > it > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher
[ https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33986. -- Resolution: Invalid > Spark handle always return LOST status in standalone cluster mode with Spark > launcher > - > > Key: SPARK-33986 > URL: https://issues.apache.org/jira/browse/SPARK-33986 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 2.4.4 > Environment: apache hadoop 2.6.5 > apache spark 2.4.4 >Reporter: ZhongyuWang >Priority: Major > > I can use it to submit spark app successfully in standalone client/yarn > client/yarn cluster mode,and get correct app status, but when i submit spark > app in standalone cluster mode, Spark handle always return LOST status(once) > and app running stablely until FINISHED( handle wasn't get any state change > infomation). I noticed when I submited app from code, after a while, the > SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher > redirect log) doesn't have any useful information. > this is my pseudo code, > {code:java} > SparkAppHandle handle = launcher.startApplication(new > SparkAppHandle.Listener() { > @Override > public void stateChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > @Override > public void infoChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > });{code} > any idea ? thx -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher
[ https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259370#comment-17259370 ] Hyukjin Kwon commented on SPARK-33986: -- Then it sounds likely a env problem. For a question, I would encourage you to ask it in the mailing list (https://spark.apache.org/community.html) first before filing it as an issue. > Spark handle always return LOST status in standalone cluster mode with Spark > launcher > - > > Key: SPARK-33986 > URL: https://issues.apache.org/jira/browse/SPARK-33986 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 2.4.4 > Environment: apache hadoop 2.6.5 > apache spark 2.4.4 >Reporter: ZhongyuWang >Priority: Major > > I can use it to submit spark app successfully in standalone client/yarn > client/yarn cluster mode,and get correct app status, but when i submit spark > app in standalone cluster mode, Spark handle always return LOST status(once) > and app running stablely until FINISHED( handle wasn't get any state change > infomation). I noticed when I submited app from code, after a while, the > SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher > redirect log) doesn't have any useful information. > this is my pseudo code, > {code:java} > SparkAppHandle handle = launcher.startApplication(new > SparkAppHandle.Listener() { > @Override > public void stateChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > @Override > public void infoChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > });{code} > any idea ? thx -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34012) Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide
[ https://issues.apache.org/jira/browse/SPARK-34012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259369#comment-17259369 ] Apache Spark commented on SPARK-34012: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/31050 > Keep behavior consistent when conf > `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration > guide > - > > Key: SPARK-34012 > URL: https://issues.apache.org/jira/browse/SPARK-34012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.1.0, 3.2.0 > > > According to > [https://github.com/apache/spark/pull/29087#issuecomment-754389257] > fix bug and add UT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34012) Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide
[ https://issues.apache.org/jira/browse/SPARK-34012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259368#comment-17259368 ] Apache Spark commented on SPARK-34012: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/31050 > Keep behavior consistent when conf > `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration > guide > - > > Key: SPARK-34012 > URL: https://issues.apache.org/jira/browse/SPARK-34012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.1.0, 3.2.0 > > > According to > [https://github.com/apache/spark/pull/29087#issuecomment-754389257] > fix bug and add UT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34015) SparkR partition timing summary reports input time correctly
[ https://issues.apache.org/jira/browse/SPARK-34015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34015. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/31021 > SparkR partition timing summary reports input time correctly > > > Key: SPARK-34015 > URL: https://issues.apache.org/jira/browse/SPARK-34015 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.2, 3.0.1 > Environment: Observed on CentOS-7 running spark 2.3.1 and on my mac > running master >Reporter: Tom Howland >Assignee: Tom Howland >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Original Estimate: 0h > Remaining Estimate: 0h > > When sparkR is run at log level INFO, a summary of how the worker spent its > time processing the partition is printed. There is a logic error where it is > over-reporting the time inputting rows. > In detail: the variable inputElap in a wider context is used to mark the > beginning of reading rows, but in the part changed here it was used as a > local variable for measuring compute time. Thus, the error is not observable > if there is only one group per partition, which is what you get in unit tests. > For our application, here's what a log entry looks like before these changes > were applied: > {{20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, > broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, > write-output = 0.020 s, total = 1021.546 s}} > this indicates that we're spending more time reading rows than operating on > the rows. > After these changes, it looks like this: > {{20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, > broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, > write-output = 0.045 s, total = 1812.553 s}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34015) SparkR partition timing summary reports input time correctly
[ https://issues.apache.org/jira/browse/SPARK-34015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34015: Assignee: Tom Howland > SparkR partition timing summary reports input time correctly > > > Key: SPARK-34015 > URL: https://issues.apache.org/jira/browse/SPARK-34015 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.2, 3.0.1 > Environment: Observed on CentOS-7 running spark 2.3.1 and on my mac > running master >Reporter: Tom Howland >Assignee: Tom Howland >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Original Estimate: 0h > Remaining Estimate: 0h > > When sparkR is run at log level INFO, a summary of how the worker spent its > time processing the partition is printed. There is a logic error where it is > over-reporting the time inputting rows. > In detail: the variable inputElap in a wider context is used to mark the > beginning of reading rows, but in the part changed here it was used as a > local variable for measuring compute time. Thus, the error is not observable > if there is only one group per partition, which is what you get in unit tests. > For our application, here's what a log entry looks like before these changes > were applied: > {{20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, > broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, > write-output = 0.020 s, total = 1021.546 s}} > this indicates that we're spending more time reading rows than operating on > the rows. > After these changes, it looks like this: > {{20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, > broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, > write-output = 0.045 s, total = 1812.553 s}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34015) SparkR partition timing summary reports input time correctly
[ https://issues.apache.org/jira/browse/SPARK-34015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34015: - Fix Version/s: 3.1.1 3.2.0 > SparkR partition timing summary reports input time correctly > > > Key: SPARK-34015 > URL: https://issues.apache.org/jira/browse/SPARK-34015 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.2, 3.0.1 > Environment: Observed on CentOS-7 running spark 2.3.1 and on my mac > running master >Reporter: Tom Howland >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Original Estimate: 0h > Remaining Estimate: 0h > > When sparkR is run at log level INFO, a summary of how the worker spent its > time processing the partition is printed. There is a logic error where it is > over-reporting the time inputting rows. > In detail: the variable inputElap in a wider context is used to mark the > beginning of reading rows, but in the part changed here it was used as a > local variable for measuring compute time. Thus, the error is not observable > if there is only one group per partition, which is what you get in unit tests. > For our application, here's what a log entry looks like before these changes > were applied: > {{20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, > broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, > write-output = 0.020 s, total = 1021.546 s}} > this indicates that we're spending more time reading rows than operating on > the rows. > After these changes, it looks like this: > {{20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, > broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, > write-output = 0.045 s, total = 1812.553 s}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34011) ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
[ https://issues.apache.org/jira/browse/SPARK-34011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34011: - Fix Version/s: 3.2.0 > ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache > > > Key: SPARK-34011 > URL: https://issues.apache.org/jira/browse/SPARK-34011 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 3.2.0 > > > Here is the example to reproduce the issue: > {code:sql} > spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED > BY (part0); > spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; > spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; > spark-sql> CACHE TABLE tbl1; > spark-sql> SELECT * FROM tbl1; > 0 0 > 1 1 > spark-sql> ALTER TABLE tbl1 PARTITION (part0 = 0) RENAME TO PARTITION (part0 > = 2); > spark-sql> SELECT * FROM tbl1; > 0 0 > 1 1 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34011) ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
[ https://issues.apache.org/jira/browse/SPARK-34011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34011. -- Resolution: Fixed > ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache > > > Key: SPARK-34011 > URL: https://issues.apache.org/jira/browse/SPARK-34011 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 3.2.0 > > > Here is the example to reproduce the issue: > {code:sql} > spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED > BY (part0); > spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; > spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; > spark-sql> CACHE TABLE tbl1; > spark-sql> SELECT * FROM tbl1; > 0 0 > 1 1 > spark-sql> ALTER TABLE tbl1 PARTITION (part0 = 0) RENAME TO PARTITION (part0 > = 2); > spark-sql> SELECT * FROM tbl1; > 0 0 > 1 1 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34011) ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
[ https://issues.apache.org/jira/browse/SPARK-34011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34011: - Fix Version/s: (was: 3.2.0) (was: 3.0.2) (was: 3.1.0) > ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache > > > Key: SPARK-34011 > URL: https://issues.apache.org/jira/browse/SPARK-34011 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > > Here is the example to reproduce the issue: > {code:sql} > spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED > BY (part0); > spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; > spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; > spark-sql> CACHE TABLE tbl1; > spark-sql> SELECT * FROM tbl1; > 0 0 > 1 1 > spark-sql> ALTER TABLE tbl1 PARTITION (part0 = 0) RENAME TO PARTITION (part0 > = 2); > spark-sql> SELECT * FROM tbl1; > 0 0 > 1 1 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34012) Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide
[ https://issues.apache.org/jira/browse/SPARK-34012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259355#comment-17259355 ] Apache Spark commented on SPARK-34012: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/31049 > Keep behavior consistent when conf > `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration > guide > - > > Key: SPARK-34012 > URL: https://issues.apache.org/jira/browse/SPARK-34012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.1.0, 3.2.0 > > > According to > [https://github.com/apache/spark/pull/29087#issuecomment-754389257] > fix bug and add UT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34002) Broken UDF Encoding
[ https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-34002: - Component/s: (was: Spark Core) SQL > Broken UDF Encoding > --- > > Key: SPARK-34002 > URL: https://issues.apache.org/jira/browse/SPARK-34002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 > Environment: Windows 10 Surface book 3 > Local Spark > IntelliJ Idea > >Reporter: Mark Hamilton >Priority: Major > > UDFs can behave differently depending on if a dataframe is cached, despite > the dataframe being identical > > Repro: > > {code:java} > import org.apache.spark.sql.expressions.UserDefinedFunction > import org.apache.spark.sql.functions.{col, udf} > case class Bar(a: Int) > > import spark.implicits._ > def f1(bar: Bar): Option[Bar] = { > None > } > def f2(bar: Bar): Option[Bar] = { > Option(bar) > } > val udf1: UserDefinedFunction = udf(f1 _) > val udf2: UserDefinedFunction = udf(f2 _) > // Commenting in the cache will make this example work > val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache() > val newDf = df > .withColumn("c1", udf1(col("c0"))) > .withColumn("c2", udf2(col("c1"))) > newDf.show() > {code} > > Error: > Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program > Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program > Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program > Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath > "C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program >
[jira] [Issue Comment Deleted] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher
[ https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhongyuWang updated SPARK-33986: Comment: was deleted (was: Yes, other applications have the same scenario(in cluster mode, after a while, the SparkSubmit process was suddenly stopped). when i try to submit PI Spark example with spark-submit shell, SparkSubmit process was also stopped after submitted job to spark cluster(not supervise spark job)) > Spark handle always return LOST status in standalone cluster mode with Spark > launcher > - > > Key: SPARK-33986 > URL: https://issues.apache.org/jira/browse/SPARK-33986 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 2.4.4 > Environment: apache hadoop 2.6.5 > apache spark 2.4.4 >Reporter: ZhongyuWang >Priority: Major > > I can use it to submit spark app successfully in standalone client/yarn > client/yarn cluster mode,and get correct app status, but when i submit spark > app in standalone cluster mode, Spark handle always return LOST status(once) > and app running stablely until FINISHED( handle wasn't get any state change > infomation). I noticed when I submited app from code, after a while, the > SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher > redirect log) doesn't have any useful information. > this is my pseudo code, > {code:java} > SparkAppHandle handle = launcher.startApplication(new > SparkAppHandle.Listener() { > @Override > public void stateChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > @Override > public void infoChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > });{code} > any idea ? thx -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher
[ https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259348#comment-17259348 ] ZhongyuWang commented on SPARK-33986: - Yes, other applications have the same scenario(in cluster mode, after a while, the SparkSubmit process was suddenly stopped). when i try to submit PI Spark example with spark-submit shell, SparkSubmit process was also stopped after submitted job to spark cluster(not supervise spark job) > Spark handle always return LOST status in standalone cluster mode with Spark > launcher > - > > Key: SPARK-33986 > URL: https://issues.apache.org/jira/browse/SPARK-33986 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 2.4.4 > Environment: apache hadoop 2.6.5 > apache spark 2.4.4 >Reporter: ZhongyuWang >Priority: Major > > I can use it to submit spark app successfully in standalone client/yarn > client/yarn cluster mode,and get correct app status, but when i submit spark > app in standalone cluster mode, Spark handle always return LOST status(once) > and app running stablely until FINISHED( handle wasn't get any state change > infomation). I noticed when I submited app from code, after a while, the > SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher > redirect log) doesn't have any useful information. > this is my pseudo code, > {code:java} > SparkAppHandle handle = launcher.startApplication(new > SparkAppHandle.Listener() { > @Override > public void stateChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > @Override > public void infoChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > });{code} > any idea ? thx -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher
[ https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259346#comment-17259346 ] ZhongyuWang commented on SPARK-33986: - Yes, other applications have the same scenario(in cluster mode, after a while, the SparkSubmit process was suddenly stopped). when i try to submit PI Spark example with spark-submit shell, SparkSubmit process was also stopped after submitted job to spark cluster(not supervise spark job) > Spark handle always return LOST status in standalone cluster mode with Spark > launcher > - > > Key: SPARK-33986 > URL: https://issues.apache.org/jira/browse/SPARK-33986 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 2.4.4 > Environment: apache hadoop 2.6.5 > apache spark 2.4.4 >Reporter: ZhongyuWang >Priority: Major > > I can use it to submit spark app successfully in standalone client/yarn > client/yarn cluster mode,and get correct app status, but when i submit spark > app in standalone cluster mode, Spark handle always return LOST status(once) > and app running stablely until FINISHED( handle wasn't get any state change > infomation). I noticed when I submited app from code, after a while, the > SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher > redirect log) doesn't have any useful information. > this is my pseudo code, > {code:java} > SparkAppHandle handle = launcher.startApplication(new > SparkAppHandle.Listener() { > @Override > public void stateChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > @Override > public void infoChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > });{code} > any idea ? thx -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34019) Keep same quantiles of UI and restful API
[ https://issues.apache.org/jira/browse/SPARK-34019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259344#comment-17259344 ] Apache Spark commented on SPARK-34019: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/31048 > Keep same quantiles of UI and restful API > - > > Key: SPARK-34019 > URL: https://issues.apache.org/jira/browse/SPARK-34019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Keep same quantiles of UI and restful API -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34019) Keep same quantiles of UI and restful API
[ https://issues.apache.org/jira/browse/SPARK-34019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34019: Assignee: (was: Apache Spark) > Keep same quantiles of UI and restful API > - > > Key: SPARK-34019 > URL: https://issues.apache.org/jira/browse/SPARK-34019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Keep same quantiles of UI and restful API -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34019) Keep same quantiles of UI and restful API
[ https://issues.apache.org/jira/browse/SPARK-34019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259343#comment-17259343 ] Apache Spark commented on SPARK-34019: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/31048 > Keep same quantiles of UI and restful API > - > > Key: SPARK-34019 > URL: https://issues.apache.org/jira/browse/SPARK-34019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Keep same quantiles of UI and restful API -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34019) Keep same quantiles of UI and restful API
[ https://issues.apache.org/jira/browse/SPARK-34019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34019: Assignee: Apache Spark > Keep same quantiles of UI and restful API > - > > Key: SPARK-34019 > URL: https://issues.apache.org/jira/browse/SPARK-34019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Keep same quantiles of UI and restful API -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34019) Keep same quantiles of UI and restful API
angerszhu created SPARK-34019: - Summary: Keep same quantiles of UI and restful API Key: SPARK-34019 URL: https://issues.apache.org/jira/browse/SPARK-34019 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu Keep same quantiles of UI and restful API -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher
[ https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259338#comment-17259338 ] Hyukjin Kwon commented on SPARK-33986: -- Do other applications work fine in cluster modes? > Spark handle always return LOST status in standalone cluster mode with Spark > launcher > - > > Key: SPARK-33986 > URL: https://issues.apache.org/jira/browse/SPARK-33986 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 2.4.4 > Environment: apache hadoop 2.6.5 > apache spark 2.4.4 >Reporter: ZhongyuWang >Priority: Major > > I can use it to submit spark app successfully in standalone client/yarn > client/yarn cluster mode,and get correct app status, but when i submit spark > app in standalone cluster mode, Spark handle always return LOST status(once) > and app running stablely until FINISHED( handle wasn't get any state change > infomation). I noticed when I submited app from code, after a while, the > SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher > redirect log) doesn't have any useful information. > this is my pseudo code, > {code:java} > SparkAppHandle handle = launcher.startApplication(new > SparkAppHandle.Listener() { > @Override > public void stateChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > @Override > public void infoChanged(SparkAppHandle handle) { > stateChangedHandle(handle.getAppId(), jobId, code, execId, > handle.getState(), driverInfo, request, infoLog, errorLog); > } > });{code} > any idea ? thx -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34002) Broken UDF Encoding
[ https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34002: - Priority: Major (was: Critical) > Broken UDF Encoding > --- > > Key: SPARK-34002 > URL: https://issues.apache.org/jira/browse/SPARK-34002 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 > Environment: Windows 10 Surface book 3 > Local Spark > IntelliJ Idea > >Reporter: Mark Hamilton >Priority: Major > > UDFs can behave differently depending on if a dataframe is cached, despite > the dataframe being identical > > Repro: > > {code:java} > import org.apache.spark.sql.expressions.UserDefinedFunction > import org.apache.spark.sql.functions.{col, udf} > case class Bar(a: Int) > > import spark.implicits._ > def f1(bar: Bar): Option[Bar] = { > None > } > def f2(bar: Bar): Option[Bar] = { > Option(bar) > } > val udf1: UserDefinedFunction = udf(f1 _) > val udf2: UserDefinedFunction = udf(f2 _) > // Commenting in the cache will make this example work > val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache() > val newDf = df > .withColumn("c1", udf1(col("c0"))) > .withColumn("c2", udf2(col("c1"))) > newDf.show() > {code} > > Error: > Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program > Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program > Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program > Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath > "C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program > Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program >
[jira] [Commented] (SPARK-33981) SparkUI: Storage page is empty even if cached
[ https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259336#comment-17259336 ] Hyukjin Kwon commented on SPARK-33981: -- [~995582386], Spark 2.3.3 is EOL. Can you see if the issue still exists in higher Spark versions? > SparkUI: Storage page is empty even if cached > - > > Key: SPARK-33981 > URL: https://issues.apache.org/jira/browse/SPARK-33981 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 2.3.3 > Environment: spark 2.3.3 >Reporter: maple >Priority: Major > Attachments: ba5a987152c6270f34b968bd89ca36a.png > > > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val rdd = sc.parallelize(1 to 10, > 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize > at :25 > scala> rdd.count > res0: Long = 10 > but SparkUI storage page is empty -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33982) Sparksql does not support when the inserted table is a read table
[ https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259335#comment-17259335 ] Hyukjin Kwon commented on SPARK-33982: -- [~hao.duan] how is it different from SPARK-34006? > Sparksql does not support when the inserted table is a read table > - > > Key: SPARK-33982 > URL: https://issues.apache.org/jira/browse/SPARK-33982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: hao >Priority: Major > > When the inserted table is a read table, sparksql will throw an error - > > org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is > also being read from.; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34016) SparkR not on CRAN (again)
[ https://issues.apache.org/jira/browse/SPARK-34016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259334#comment-17259334 ] Hyukjin Kwon commented on SPARK-34016: -- cc [~dongjoon], [~felixcheung], [~shivaram] FYI > SparkR not on CRAN (again) > -- > > Key: SPARK-34016 > URL: https://issues.apache.org/jira/browse/SPARK-34016 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 2.4.7 >Reporter: Paul Johnson >Priority: Major > > SparkR is removed from CRAN > [https://cran.r-project.org/web/packages/SparkR/index.html#:~:text=Package%20'SparkR'%20was%20removed%20from,be%20obtained%20from%20the%20archive.=A%20summary%20of%20the%20most,to%20link%20to%20this%20page.] > Is there anything we can do to help? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34018) finishedExecutorWithRunningSidecar in test utils depends on nulls matching
Holden Karau created SPARK-34018: Summary: finishedExecutorWithRunningSidecar in test utils depends on nulls matching Key: SPARK-34018 URL: https://issues.apache.org/jira/browse/SPARK-34018 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.1.0, 3.2.0, 3.1.1 Reporter: Holden Karau Assignee: Holden Karau Currently the test passes because we match the null on the container state string. To fix this label both the statuses and ensure the ExecutorPodSnapshot starts with the default config to match. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33935) Fix CBOs cost function
[ https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33935: - Fix Version/s: 2.4.8 > Fix CBOs cost function > --- > > Key: SPARK-33935 > URL: https://issues.apache.org/jira/browse/SPARK-33935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.1, 3.1.0, 3.2.0 >Reporter: Tanel Kiis >Assignee: Tanel Kiis >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0, 3.2.0 > > > The parameter spark.sql.cbo.joinReorder.card.weight is decumented as: > {code:title=spark.sql.cbo.joinReorder.card.weight} > The weight of cardinality (number of rows) for plan cost comparison in join > reorder: rows * weight + size * (1 - weight). > {code} > But in the implementation the formula is a bit different: > {code:title=Current implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > if (other.planCost.card == 0 || other.planCost.size == 0) { > false > } else { > val relativeRows = BigDecimal(this.planCost.card) / > BigDecimal(other.planCost.card) > val relativeSize = BigDecimal(this.planCost.size) / > BigDecimal(other.planCost.size) > relativeRows * conf.joinReorderCardWeight + > relativeSize * (1 - conf.joinReorderCardWeight) < 1 > } > } > {code} > This change has an unfortunate consequence: > given two plans A and B, both A betterThan B and B betterThan A might give > the same results. This happes when one has many rows with small sizes and > other has few rows with large sizes. > A example values, that have this fenomen with the default weight value (0.7): > A.card = 500, B.card = 300 > A.size = 30, B.size = 80 > Both A betterThan B and B betterThan A would have score above 1 and would > return false. > A new implementation is proposed, that matches the documentation: > {code:title=Proposed implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > val oldCost = BigDecimal(this.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight) > val newCost = BigDecimal(other.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight) > newCost < oldCost > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34012) Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide
[ https://issues.apache.org/jira/browse/SPARK-34012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-34012. -- Fix Version/s: 3.2.0 3.1.0 Assignee: angerszhu Resolution: Fixed Resolved by https://github.com/apache/spark/pull/31039 > Keep behavior consistent when conf > `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration > guide > - > > Key: SPARK-34012 > URL: https://issues.apache.org/jira/browse/SPARK-34012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.1.0, 3.2.0 > > > According to > [https://github.com/apache/spark/pull/29087#issuecomment-754389257] > fix bug and add UT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34012) Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide
[ https://issues.apache.org/jira/browse/SPARK-34012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-34012: - Affects Version/s: 3.1.0 3.0.2 2.4.8 > Keep behavior consistent when conf > `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration > guide > - > > Key: SPARK-34012 > URL: https://issues.apache.org/jira/browse/SPARK-34012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0 >Reporter: angerszhu >Priority: Major > > According to > [https://github.com/apache/spark/pull/29087#issuecomment-754389257] > fix bug and add UT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33833) Allow Spark Structured Streaming report Kafka Lag through Burrow
[ https://issues.apache.org/jira/browse/SPARK-33833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259280#comment-17259280 ] L. C. Hsieh commented on SPARK-33833: - Thanks [~kabhwan]. > Allow Spark Structured Streaming report Kafka Lag through Burrow > > > Key: SPARK-33833 > URL: https://issues.apache.org/jira/browse/SPARK-33833 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Sam Davarnia >Priority: Major > > Because structured streaming tracks Kafka offset consumption by itself, > It is not possible to track total Kafka lag using Burrow similar to DStreams > We have used Stream hooks as mentioned > [here|https://medium.com/@ronbarabash/how-to-measure-consumer-lag-in-spark-structured-streaming-6c3645e45a37] > > It would be great if Spark supports this feature out of the box. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33833) Allow Spark Structured Streaming report Kafka Lag through Burrow
[ https://issues.apache.org/jira/browse/SPARK-33833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259263#comment-17259263 ] Jungtaek Lim commented on SPARK-33833: -- (Just to make clear, I was talking about setting the group ID on main purpose, not for SPARK-27549. If I'm objecting this, then I won't create my own project. If you think this is good to have, please raise a discussion thread to gather consensus on this, so that we can make progress without a risk to soft reject again.) > Allow Spark Structured Streaming report Kafka Lag through Burrow > > > Key: SPARK-33833 > URL: https://issues.apache.org/jira/browse/SPARK-33833 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Sam Davarnia >Priority: Major > > Because structured streaming tracks Kafka offset consumption by itself, > It is not possible to track total Kafka lag using Burrow similar to DStreams > We have used Stream hooks as mentioned > [here|https://medium.com/@ronbarabash/how-to-measure-consumer-lag-in-spark-structured-streaming-6c3645e45a37] > > It would be great if Spark supports this feature out of the box. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org