date:20210105

[jira] [Created] (SPARK-34024) datasourceV1 VS dataSourceV2

2021-01-05 Thread Zhenglin luo (Jira)

Zhenglin luo created SPARK-34024:


 Summary: datasourceV1 VS  dataSourceV2 
 Key: SPARK-34024
 URL: https://issues.apache.org/jira/browse/SPARK-34024
 Project: Spark
  Issue Type: Question
  Components: Input/Output
Affects Versions: 3.0.0
Reporter: Zhenglin luo


I found that DataSourceV2 has been through many versions .So why hasn't 
datasourceV2 been used by default until now in the latest version.I want to 
know if it’s because there is a big difference in execution efficiency between 
v1 and v2. Or there are other reasons.

Thanks a lot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34023) Updated to Kryo 5.0.3

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34023:


Assignee: (was: Apache Spark)

> Updated to Kryo 5.0.3
> -
>
> Key: SPARK-34023
> URL: https://issues.apache.org/jira/browse/SPARK-34023
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Nick Nezis
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34023) Updated to Kryo 5.0.3

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34023:


Assignee: Apache Spark

> Updated to Kryo 5.0.3
> -
>
> Key: SPARK-34023
> URL: https://issues.apache.org/jira/browse/SPARK-34023
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Nick Nezis
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34023) Updated to Kryo 5.0.3

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259469#comment-17259469
 ] 

Apache Spark commented on SPARK-34023:
--

User 'nicknezis' has created a pull request for this issue:
https://github.com/apache/spark/pull/31059

> Updated to Kryo 5.0.3
> -
>
> Key: SPARK-34023
> URL: https://issues.apache.org/jira/browse/SPARK-34023
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Nick Nezis
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34011) ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259468#comment-17259468
 ] 

Apache Spark commented on SPARK-34011:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31060

> ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
> 
>
> Key: SPARK-34011
> URL: https://issues.apache.org/jira/browse/SPARK-34011
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0
>
>
> Here is the example to reproduce the issue:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED 
> BY (part0);
> spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0;
> spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1;
> spark-sql> CACHE TABLE tbl1;
> spark-sql> SELECT * FROM tbl1;
> 0 0
> 1 1
> spark-sql> ALTER TABLE tbl1 PARTITION (part0 = 0) RENAME TO PARTITION (part0 
> = 2);
> spark-sql> SELECT * FROM tbl1;
> 0 0
> 1 1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34023) Updated to Kryo 5.0.3

2021-01-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34023:
--
Reporter: Nick Nezis  (was: Dongjoon Hyun)

> Updated to Kryo 5.0.3
> -
>
> Key: SPARK-34023
> URL: https://issues.apache.org/jira/browse/SPARK-34023
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Nick Nezis
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34023) Updated to Kryo 5.0.3

2021-01-05 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-34023:
-

 Summary: Updated to Kryo 5.0.3
 Key: SPARK-34023
 URL: https://issues.apache.org/jira/browse/SPARK-34023
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34022) Support latest mkdocs in SQL functions doc

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34022:
-
Description: The list on the sidebar does not show the list of functions.

> Support latest mkdocs in SQL functions doc
> --
>
> Key: SPARK-34022
> URL: https://issues.apache.org/jira/browse/SPARK-34022
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Attachments: Screen Shot 2021-01-06 at 4.22.50 PM.png
>
>
> The list on the sidebar does not show the list of functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34022) Support latest mkdocs in SQL functions doc

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34022:
-
Attachment: Screen Shot 2021-01-06 at 4.22.50 PM.png

> Support latest mkdocs in SQL functions doc
> --
>
> Key: SPARK-34022
> URL: https://issues.apache.org/jira/browse/SPARK-34022
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Attachments: Screen Shot 2021-01-06 at 4.22.50 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34022) Support latest mkdocs in SQL functions doc

2021-01-05 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-34022:


 Summary: Support latest mkdocs in SQL functions doc
 Key: SPARK-34022
 URL: https://issues.apache.org/jira/browse/SPARK-34022
 Project: Spark
  Issue Type: Task
  Components: Documentation, SQL
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon
 Attachments: Screen Shot 2021-01-06 at 4.22.50 PM.png





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop--jdk--scala-2.13

2021-01-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33948.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 31055
[https://github.com/apache/spark/pull/31055]

> ExpressionEncoderSuite  failed in 
> spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
> ---
>
> Key: SPARK-33948
> URL: https://issues.apache.org/jira/browse/SPARK-33948
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
> Environment: * 
>  
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink
>  
> [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Double,

[jira] [Assigned] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop--jdk--scala-2.13

2021-01-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33948:
-

Assignee: Yang Jie

> ExpressionEncoderSuite  failed in 
> spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
> ---
>
> Key: SPARK-33948
> URL: https://issues.apache.org/jira/browse/SPARK-33948
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
> Environment: * 
>  
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink
>  
> [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Double, Double)],ArrayBuffer((1.0,2.0))) (codegen 
>

[jira] [Commented] (SPARK-33966) Two-tier encryption key management

2021-01-05 Thread DB Tsai (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259454#comment-17259454
 ] 

DB Tsai commented on SPARK-33966:
-

cc [~dongjoon] [~chaosun] and [~viirya]

> Two-tier encryption key management
> --
>
> Key: SPARK-33966
> URL: https://issues.apache.org/jira/browse/SPARK-33966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gidon Gershinsky
>Priority: Major
>
> Columnar data formats (Parquet and ORC) have recently added a column 
> encryption capability. The data protection follows the practice of envelope 
> encryption, where the Data Encryption Key (DEK) is freshly generated for each 
> file/column, and is encrypted with a master key (or an intermediate key, that 
> is in turn encrypted with a master key). The master keys are kept in a 
> centralized Key Management Service (KMS) - meaning that each Spark worker 
> needs to interact with a (typically slow) KMS server. 
> This Jira (and its sub-tasks) introduce an alternative approach, that on one 
> hand preserves the best practice of generating fresh encryption keys for each 
> data file/column, and on the other hand allows Spark clusters to have a 
> scalable interaction with a KMS server, by delegating it to the application 
> driver. This is done via two-tier management of the keys, where a random Key 
> Encryption Key (KEK) is generated by the driver, encrypted by the master key 
> in the KMS, and distributed by the driver to the workers, so they can use it 
> to encrypt the DEKs, generated there by Parquet or ORC libraries. In the 
> workers, the KEKs are distributed to the executors/threads in the write path. 
> In the read path, the encrypted KEKs are fetched by workers from file 
> metadata, decrypted via interaction with the driver, and shared among the 
> executors/threads.
> The KEK layer further improves scalability of the key management, because 
> neither driver or workers need to interact with the KMS for each file/column.
> Stand-alone Parquet/ORC libraries (without Spark) and/or other frameworks 
> (e.g., Presto, pandas) must be able to read/decrypt the files, 
> written/encrypted by this Spark-driven key management mechanism - and 
> vice-versa. [of course, only if both sides have proper authorisation for 
> using the master keys in the KMS]
> A link to a discussion/design doc is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2021-01-05 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259446#comment-17259446
 ] 

Jungtaek Lim commented on SPARK-33635:
--

One more point, though the root cause is actually the changes on "Kafka" - 
you'll find the huge difference according to the size of the log file between 
Spark 2.4 vs 3.0.

Some "debug" log messages in Kafka 2.0 (which Spark 2.4 uses) were re-labeled 
to the "info" log messages in later version of Kafka (at least including Kafka 
2.4 which Spark 3.0 uses). I found the changes in KafkaConsumer.seek(), and 
there could be more.

If you feel these messages are flooding and likely affecting the performance 
(whereas it would be unlikely), you can change your log4j configuration to 
suppress it.

log4j.logger.org.apache.kafka.clients.consumer.KafkaConsumer=WARN

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Assignee: Jungtaek Lim
>Priority: Blocker
> Fix For: 3.0.2, 3.1.0
>
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33635) Performance regression in Kafka read

2021-01-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33635:
--
Target Version/s: 3.0.2, 3.1.0

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Assignee: Jungtaek Lim
>Priority: Blocker
> Fix For: 3.0.2, 3.1.0
>
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33635) Performance regression in Kafka read

2021-01-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33635.
---
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 31056
[https://github.com/apache/spark/pull/31056]

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Assignee: Jungtaek Lim
>Priority: Blocker
> Fix For: 3.1.0, 3.0.2
>
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33635) Performance regression in Kafka read

2021-01-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33635:
-

Assignee: Jungtaek Lim

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Assignee: Jungtaek Lim
>Priority: Blocker
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259432#comment-17259432
 ] 

Apache Spark commented on SPARK-34021:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31058

> Fix hyper links in SparkR documentation for CRAN submission
> ---
>
> Key: SPARK-34021
> URL: https://issues.apache.org/jira/browse/SPARK-34021
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> CRAN submission fails due to:
> {code}
>Found the following (possibly) invalid URLs:
>  URL: http://jsonlines.org/ (moved to https://jsonlines.org/)
>From: man/read.json.Rd
>  man/write.json.Rd
>Status: 200
>Message: OK
>  URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to
> https://dl.acm.org/doi/10.1109/MC.2009.263)
>From: inst/doc/sparkr-vignettes.html
>Status: 200
>Message: OK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34021:


Assignee: (was: Apache Spark)

> Fix hyper links in SparkR documentation for CRAN submission
> ---
>
> Key: SPARK-34021
> URL: https://issues.apache.org/jira/browse/SPARK-34021
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> CRAN submission fails due to:
> {code}
>Found the following (possibly) invalid URLs:
>  URL: http://jsonlines.org/ (moved to https://jsonlines.org/)
>From: man/read.json.Rd
>  man/write.json.Rd
>Status: 200
>Message: OK
>  URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to
> https://dl.acm.org/doi/10.1109/MC.2009.263)
>From: inst/doc/sparkr-vignettes.html
>Status: 200
>Message: OK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34021:


Assignee: Apache Spark

> Fix hyper links in SparkR documentation for CRAN submission
> ---
>
> Key: SPARK-34021
> URL: https://issues.apache.org/jira/browse/SPARK-34021
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Blocker
>
> CRAN submission fails due to:
> {code}
>Found the following (possibly) invalid URLs:
>  URL: http://jsonlines.org/ (moved to https://jsonlines.org/)
>From: man/read.json.Rd
>  man/write.json.Rd
>Status: 200
>Message: OK
>  URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to
> https://dl.acm.org/doi/10.1109/MC.2009.263)
>From: inst/doc/sparkr-vignettes.html
>Status: 200
>Message: OK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259430#comment-17259430
 ] 

Apache Spark commented on SPARK-34021:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31058

> Fix hyper links in SparkR documentation for CRAN submission
> ---
>
> Key: SPARK-34021
> URL: https://issues.apache.org/jira/browse/SPARK-34021
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> CRAN submission fails due to:
> {code}
>Found the following (possibly) invalid URLs:
>  URL: http://jsonlines.org/ (moved to https://jsonlines.org/)
>From: man/read.json.Rd
>  man/write.json.Rd
>Status: 200
>Message: OK
>  URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to
> https://dl.acm.org/doi/10.1109/MC.2009.263)
>From: inst/doc/sparkr-vignettes.html
>Status: 200
>Message: OK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34008) Upgrade derby to 10.14.2.0

2021-01-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34008.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31032
[https://github.com/apache/spark/pull/31032]

> Upgrade derby to 10.14.2.0
> --
>
> Key: SPARK-34008
> URL: https://issues.apache.org/jira/browse/SPARK-34008
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> Derby 10.14.2.0 seems to be the final release which support JDK8 as the 
> minimum required version so let's upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread Maple (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377
 ] 

Maple edited comment on SPARK-33981 at 1/6/21, 5:48 AM:


I know what's the problem.

On spark history server UI,default  18080 port, the storage page is empty,even 
though the application is running.

And on yarn ui, default 8088 port,
 [http://localhost:8088/proxy/application_1609225827199_0004/storage/]
 the storage page is not empty when application is running,if the application 
completes,storage page becomes empty.


was (Author: 995582386):
I know what's the problem.

On spark history server UI,default  18080 port, the storage page is empty,even 
though the application is running.

And on yarn ui, default 8088,
 [http://localhost:8088/proxy/application_1609225827199_0004/storage/]
 the storage page is not empty when application is running,if the application 
completes,storage page becomes empty.

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: Maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread Maple (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377
 ] 

Maple edited comment on SPARK-33981 at 1/6/21, 5:33 AM:


I know what's the problem.

On spark history server UI,default  18080 port, the storage page is empty,even 
though the application is running.

And on yarn ui, default 8088,
 [http://localhost:8088/proxy/application_1609225827199_0004/storage/]
 the storage page is not empty when application is running,if the application 
completes,storage page becomes empty.


was (Author: 995582386):
I know what's the problem.

On spark history server UI,default  18080 port, the storage page is empty.even 
though the application is running.

And on yarn ui, default 8088,
 [http://localhost:8088/proxy/application_1609225827199_0004/storage/]
 the storage page is not empty when application is running,if the application 
completes,storage page becomes empty.

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: Maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34017) Pass json column information via pruneColumns()

2021-01-05 Thread Ted Yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259423#comment-17259423
 ] 

Ted Yu commented on SPARK-34017:


For PushDownUtils#pruneColumns, I am experimenting with the following:
{code}
  case r: SupportsPushDownRequiredColumns if 
SQLConf.get.nestedSchemaPruningEnabled =>
val JSONCapture = "get_json_object\\((.*), *(.*)\\)".r
var jsonRootFields : ArrayBuffer[RootField] = ArrayBuffer()
projects.map{ _.map{ f => f.toString match {
  case JSONCapture(column, field) =>
jsonRootFields += RootField(StructField(column, f.dataType, 
f.nullable),
  derivedFromAtt = false, prunedIfAnyChildAccessed = true)
  case _ => logDebug("else " + f)
}}}
val rootFields = SchemaPruning.identifyRootFields(projects, filters) ++ 
jsonRootFields
{code}

> Pass json column information via pruneColumns()
> ---
>
> Key: SPARK-34017
> URL: https://issues.apache.org/jira/browse/SPARK-34017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Ted Yu
>Priority: Major
>
> Currently PushDownUtils#pruneColumns only passes root fields to 
> SupportsPushDownRequiredColumns implementation(s).
> {code}
> 2021-01-05 19:36:07,437 (Time-limited test) [DEBUG - 
> org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] nested schema 
> projection List(id#33, address#34, phone#36, get_json_object(phone#36, 
> $.code) AS get_json_object(phone, $.code)#37)
> 2021-01-05 19:36:07,438 (Time-limited test) [DEBUG - 
> org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] nested schema 
> StructType(StructField(id,IntegerType,false), 
> StructField(address,StringType,true), StructField(phone,StringType,true))
> {code}
> The first line shows projections and the second line shows the pruned schema.
> We can see that get_json_object(phone#36, $.code) is filtered. This 
> expression retrieves field 'code' from phone json column.
> We should allow json column information to be passed via pruneColumns().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread Maple (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maple updated SPARK-33981:
--
Attachment: (was: image-2021-01-06-11-20-49-804.png)

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: Maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread Maple (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maple updated SPARK-33981:
--
Attachment: (was: screenshot-1.png)

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: Maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread Maple (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maple updated SPARK-33981:
--
Attachment: (was: image-2021-01-06-11-19-09-849.png)

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: Maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread Maple (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377
 ] 

Maple edited comment on SPARK-33981 at 1/6/21, 5:31 AM:


I know what's the problem.

On spark history server UI,default  18080 port, the storage page is empty.even 
though the application is running.

And on yarn ui, default 8088,
 [http://localhost:8088/proxy/application_1609225827199_0004/storage/]
 the storage page is not empty when application is running,if the application 
completes,storage page becomes empty.


was (Author: 995582386):
I know what's the problem.

On spark history server UI,default  18080 port, the storage page is empty.even 
though the application is running.

And on yarn ui, default 8088,
 [http://localhost:8088/proxy/application_1609225827199_0004/storage/]
 the storage page is not empty when application is running,if the application 
complete,storage page becomes empty.

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: Maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png, 
> image-2021-01-06-11-19-09-849.png, image-2021-01-06-11-20-49-804.png, 
> screenshot-1.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread Maple (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377
 ] 

Maple edited comment on SPARK-33981 at 1/6/21, 5:30 AM:


I know what's the problem.

On spark history server UI,default  18080 port, the storage page is empty.even 
though the application is running.

And on yarn ui, default 8088,
 [http://localhost:8088/proxy/application_1609225827199_0004/storage/]
 the storage page is not empty when application is running,if the application 
complete,storage page becomes empty.


was (Author: 995582386):
I try in Spark 2.4.7,it still exists.

!image-2021-01-06-11-20-49-804.png!
 !screenshot-1.png!!image-2021-01-06-11-19-09-849.png!

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: Maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png, 
> image-2021-01-06-11-19-09-849.png, image-2021-01-06-11-20-49-804.png, 
> screenshot-1.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34021:
-
Target Version/s: 3.1.0

> Fix hyper links in SparkR documentation for CRAN submission
> ---
>
> Key: SPARK-34021
> URL: https://issues.apache.org/jira/browse/SPARK-34021
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> CRAN submission fails due to:
> {code}
>Found the following (possibly) invalid URLs:
>  URL: http://jsonlines.org/ (moved to https://jsonlines.org/)
>From: man/read.json.Rd
>  man/write.json.Rd
>Status: 200
>Message: OK
>  URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to
> https://dl.acm.org/doi/10.1109/MC.2009.263)
>From: inst/doc/sparkr-vignettes.html
>Status: 200
>Message: OK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34021:
-
Priority: Blocker  (was: Critical)

> Fix hyper links in SparkR documentation for CRAN submission
> ---
>
> Key: SPARK-34021
> URL: https://issues.apache.org/jira/browse/SPARK-34021
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> CRAN submission fails due to:
> {code}
>Found the following (possibly) invalid URLs:
>  URL: http://jsonlines.org/ (moved to https://jsonlines.org/)
>From: man/read.json.Rd
>  man/write.json.Rd
>Status: 200
>Message: OK
>  URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to
> https://dl.acm.org/doi/10.1109/MC.2009.263)
>From: inst/doc/sparkr-vignettes.html
>Status: 200
>Message: OK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34020) IndexOutOfBoundsException on merge of two pyspark frames

2021-01-05 Thread Darshat (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darshat resolved SPARK-34020.
-
Resolution: Invalid

The call stack pointed to an issue with apply, not the join. I think this may 
be the udf limit issue again.

> IndexOutOfBoundsException on merge of two pyspark frames
> 
>
> Key: SPARK-34020
> URL: https://issues.apache.org/jira/browse/SPARK-34020
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Darshat
>Priority: Major
>
> We are using databricks on Azure, with Apache spack 3.0.0 and Scala 2.12. 
> When two tables are joined - one with 36million rows, other with 4k rows we 
> get an IndexOutOfBoundsException with arrow on the call stack.
> The cluster has 72 nodes and 288 cores. Workers have 16gb memory overall. The 
> spark.sql.shuffle.partitions is set to 288.
> If the join key has uneven distribution, we tried to also partition it into 
> 1000 partitions of the join key using repartition but results in same error.
> Any pointers on what can be causing this issue would be very helpful. Thanks,
> Darshat
> {{21/01/06 04:05:06 ERROR ArrowPythonRunner: Python worker exited 
> unexpectedly (crashed)}}
> {{org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):}}
> {{ File "/databricks/spark/python/pyspark/worker.py", line 640, in main}}
> {{ eval_type = read_int(infile)}}
> {{ File "/databricks/spark/python/pyspark/serializers.py", line 603, in 
> read_int}}
> {{ raise EOFError}}
> {{EOFError}}{{at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:585)}}
> {{ at 
> org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:99)}}
> {{ at 
> org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49)}}
> {{ at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:538)}}
> {{ at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)}}
> {{ at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)}}
> {{ at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)}}
> {{ at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage16.processNext(Unknown
>  Source)}}
> {{ at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)}}
> {{ at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)}}
> {{ at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)}}
> {{ at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)}}
> {{ at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)}}
> {{ at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)}}
> {{ at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)}}
> {{ at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)}}
> {{ at org.apache.spark.scheduler.Task.run(Task.scala:117)}}
> {{ at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)}}
> {{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)}}
> {{ at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)}}
> {{ at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
> {{ at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
> {{ at java.lang.Thread.run(Thread.java:748)}}
> {{Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: 
> 1073741824 (expected: range(0, 0))}}
> {{ at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)}}
> {{ at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)}}
> {{ at 
> org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)}}
> {{ at 
> org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)}}
> {{ at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)}}
> {{ at 
> org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:278)}}
> {{ at 
> org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:139)}}
> {{ at 
> org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:93)}}
> {{ at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)}}
> {{ at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)}}
> {{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)}}
> {{ at 
>

[jira] [Created] (SPARK-34021) Fix hyper links in SparkR documentation for CRAN submission

2021-01-05 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-34021:


 Summary: Fix hyper links in SparkR documentation for CRAN 
submission
 Key: SPARK-34021
 URL: https://issues.apache.org/jira/browse/SPARK-34021
 Project: Spark
  Issue Type: Task
  Components: SparkR
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


CRAN submission fails due to:

{code}
   Found the following (possibly) invalid URLs:
 URL: http://jsonlines.org/ (moved to https://jsonlines.org/)
   From: man/read.json.Rd
 man/write.json.Rd
   Status: 200
   Message: OK
 URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to
https://dl.acm.org/doi/10.1109/MC.2009.263)
   From: inst/doc/sparkr-vignettes.html
   Status: 200
   Message: OK
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259415#comment-17259415
 ] 

Apache Spark commented on SPARK-33029:
--

User 'baohe-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31057

> Standalone mode blacklist executors page UI marks driver as blacklisted
> ---
>
> Key: SPARK-33029
> URL: https://issues.apache.org/jira/browse/SPARK-33029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Baohe Zhang
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png, Screen Shot 
> 2020-09-29 at 1.53.37 PM.png
>
>
> I am running a spark shell on a 1 node standalone cluster.  I noticed that 
> the executors page ui was marking the driver as blacklisted for the stage 
> that is running.  Attached a screen shot.
> Also, in my case one of the executors died and it doesn't seem like the 
> schedule rpicked up the new one.  It doesn't show up on the stages page and 
> just shows it as active but none of the tasks ran there.
>  
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
>  
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33635) Performance regression in Kafka read

2021-01-05 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-33635:
-
Priority: Blocker  (was: Major)

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Blocker
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34004) Change FrameLessOffsetWindowFunction as sealed abstract class

2021-01-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34004.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31026
[https://github.com/apache/spark/pull/31026]

> Change FrameLessOffsetWindowFunction as sealed abstract class
> -
>
> Key: SPARK-34004
> URL: https://issues.apache.org/jira/browse/SPARK-34004
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> Change FrameLessOffsetWindowFunction as sealed abstract class so that 
> simplify pattern match.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34004) Change FrameLessOffsetWindowFunction as sealed abstract class

2021-01-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34004:
-

Assignee: jiaan.geng

> Change FrameLessOffsetWindowFunction as sealed abstract class
> -
>
> Key: SPARK-34004
> URL: https://issues.apache.org/jira/browse/SPARK-34004
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Change FrameLessOffsetWindowFunction as sealed abstract class so that 
> simplify pattern match.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33635) Performance regression in Kafka read

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33635:


Assignee: Apache Spark

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Assignee: Apache Spark
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33635) Performance regression in Kafka read

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33635:


Assignee: (was: Apache Spark)

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259407#comment-17259407
 ] 

Apache Spark commented on SPARK-33635:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/31056

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop--jdk--scala-2.13

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259406#comment-17259406
 ] 

Apache Spark commented on SPARK-33948:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31055

> ExpressionEncoderSuite  failed in 
> spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
> ---
>
> Key: SPARK-33948
> URL: https://issues.apache.org/jira/browse/SPARK-33948
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
> Environment: * 
>  
>Reporter: Yang Jie
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink
>  
> [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Double, Double)],ArrayBuffer((1.0,2.0))) (codegen 
>

[jira] [Commented] (SPARK-33100) Support parse the sql statements with c-style comments

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259405#comment-17259405
 ] 

Apache Spark commented on SPARK-33100:
--

User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/31054

> Support parse the sql statements with c-style comments
> --
>
> Key: SPARK-33100
> URL: https://issues.apache.org/jira/browse/SPARK-33100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Assignee: feiwang
>Priority: Minor
> Fix For: 3.1.0, 3.2.0
>
>
> Now the spark-sql does not support parse the sql statements with C-style 
> comments.
> For the sql statements:
> {code:java}
> /* SELECT 'test'; */
> SELECT 'test';
> {code}
> Would be split to two statements:
> The first: "/* SELECT 'test'"
> The second: "*/ SELECT 'test'"
> Then it would throw an exception because the first one is illegal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop--jdk--scala-2.13

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33948:


Assignee: Apache Spark

> ExpressionEncoderSuite  failed in 
> spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
> ---
>
> Key: SPARK-33948
> URL: https://issues.apache.org/jira/browse/SPARK-33948
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
> Environment: * 
>  
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink
>  
> [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Double, Double)],ArrayBuffer((1.0,2.0))) (codegen 
>

[jira] [Commented] (SPARK-33100) Support parse the sql statements with c-style comments

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259404#comment-17259404
 ] 

Apache Spark commented on SPARK-33100:
--

User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/31054

> Support parse the sql statements with c-style comments
> --
>
> Key: SPARK-33100
> URL: https://issues.apache.org/jira/browse/SPARK-33100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Assignee: feiwang
>Priority: Minor
> Fix For: 3.1.0, 3.2.0
>
>
> Now the spark-sql does not support parse the sql statements with C-style 
> comments.
> For the sql statements:
> {code:java}
> /* SELECT 'test'; */
> SELECT 'test';
> {code}
> Would be split to two statements:
> The first: "/* SELECT 'test'"
> The second: "*/ SELECT 'test'"
> Then it would throw an exception because the first one is illegal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop--jdk--scala-2.13

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33948:


Assignee: (was: Apache Spark)

> ExpressionEncoderSuite  failed in 
> spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
> ---
>
> Key: SPARK-33948
> URL: https://issues.apache.org/jira/browse/SPARK-33948
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
> Environment: * 
>  
>Reporter: Yang Jie
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink
>  
> [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Double, Double)],ArrayBuffer((1.0,2.0))) (codegen 
>

[jira] [Commented] (SPARK-33948) ExpressionEncoderSuite failed in spark-branch-3.1-test-maven-hadoop--jdk--scala-2.13

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259403#comment-17259403
 ] 

Apache Spark commented on SPARK-33948:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31055

> ExpressionEncoderSuite  failed in 
> spark-branch-3.1-test-maven-hadoop-*-jdk-*-scala-2.13
> ---
>
> Key: SPARK-33948
> URL: https://issues.apache.org/jira/browse/SPARK-33948
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
> Environment: * 
>  
>Reporter: Yang Jie
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-jdk-11-scala-2.13/118/#showFailuresLink
>  
> [ExpressionEncoderSuite|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__String__String___ArrayBuffer__a_b_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Int, Int)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Int__Int___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Long, Long)],ArrayBuffer((1,2))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Long__Long___ArrayBuffer__1_2_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (codegen 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_codegen_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Float, Float)],ArrayBuffer((1.0,2.0))) (interpreted 
> path)|https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-3.2-scala-2.13/101/testReport/junit/org.apache.spark.sql.catalyst.encoders/ExpressionEncoderSuite/encode_decode_for_Tuple2___ArrayBuffer__Float__Float___ArrayBuffer__1_0_2_0_interpreted_path_/]
>  * 
> [org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite.encode/decode 
> for Tuple2: (ArrayBuffer[(Double, Double)],ArrayBuffer((1.0,2.0))) (codegen 
>

[jira] [Updated] (SPARK-34020) IndexOutOfBoundsException on merge of two pyspark frames

2021-01-05 Thread Darshat (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darshat updated SPARK-34020:

Description: 
We are using databricks on Azure, with Apache spack 3.0.0 and Scala 2.12. When 
two tables are joined - one with 36million rows, other with 4k rows we get an 
IndexOutOfBoundsException with arrow on the call stack.
The cluster has 72 nodes and 288 cores. Workers have 16gb memory overall. The 
spark.sql.shuffle.partitions is set to 288.

If the join key has uneven distribution, we tried to also partition it into 
1000 partitions of the join key using repartition but results in same error.

Any pointers on what can be causing this issue would be very helpful. Thanks,

Darshat

{{21/01/06 04:05:06 ERROR ArrowPythonRunner: Python worker exited unexpectedly 
(crashed)}}
{{org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):}}
{{ File "/databricks/spark/python/pyspark/worker.py", line 640, in main}}
{{ eval_type = read_int(infile)}}
{{ File "/databricks/spark/python/pyspark/serializers.py", line 603, in 
read_int}}
{{ raise EOFError}}
{{EOFError}}{{at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:585)}}
{{ at 
org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:99)}}
{{ at 
org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49)}}
{{ at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:538)}}
{{ at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)}}
{{ at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)}}
{{ at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)}}
{{ at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage16.processNext(Unknown
 Source)}}
{{ at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)}}
{{ at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)}}
{{ at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)}}
{{ at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)}}
{{ at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)}}
{{ at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)}}
{{ at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)}}
{{ at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)}}
{{ at org.apache.spark.scheduler.Task.run(Task.scala:117)}}
{{ at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)}}
{{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)}}
{{ at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
{{ at java.lang.Thread.run(Thread.java:748)}}
{{Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 
(expected: range(0, 0))}}
{{ at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)}}
{{ at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)}}
{{ at 
org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)}}
{{ at 
org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)}}
{{ at 
org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)}}
{{ at 
org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:278)}}
{{ at 
org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:139)}}
{{ at 
org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:93)}}
{{ at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)}}
{{ at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)}}
{{ at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)}}
{{ at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)}}
{{ at 
org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:465)}}
{{ at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2124)}}
{{ at 
org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:257)}}
{{21/01/06 04:05:06 ERROR ArrowPythonRunner: This may have been caused by a 
prior exception:}}
{{java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: 
range(0, 0))}}
{{ at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)}}
{{ at

[jira] [Commented] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259401#comment-17259401
 ] 

Apache Spark commented on SPARK-32165:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/31053

> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
>  
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259400#comment-17259400
 ] 

Apache Spark commented on SPARK-32165:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/31053

> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
>  
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32165:


Assignee: Apache Spark

> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Assignee: Apache Spark
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
>  
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32165:


Assignee: (was: Apache Spark)

> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
>  
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read

2021-01-05 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259399#comment-17259399
 ] 

Jungtaek Lim edited comment on SPARK-33635 at 1/6/21, 4:18 AM:
---

I've spent some time to trace the issue, and noticed SPARK-29054 (+SPARK-30495) 
caused performance regression (though the patch itself is doing the right 
thing).

{code}
  private[kafka010] def getOrRetrieveConsumer(): InternalKafkaConsumer = {
if (!_consumer.isDefined) {
  retrieveConsumer()
}
require(_consumer.isDefined, "Consumer must be defined")
if (KafkaTokenUtil.needTokenUpdate(SparkEnv.get.conf, 
_consumer.get.kafkaParamsWithSecurity,
_consumer.get.clusterConfig)) {
  logDebug("Cached consumer uses an old delegation token, invalidating.")
  releaseConsumer()
  consumerPool.invalidateKey(cacheKey)
  fetchedDataPool.invalidate(cacheKey)
  retrieveConsumer()
}
_consumer.get
  }
{code}

{code}
  def needTokenUpdate(
  sparkConf: SparkConf,
  params: ju.Map[String, Object],
  clusterConfig: Option[KafkaTokenClusterConf]): Boolean = {
if (HadoopDelegationTokenManager.isServiceEnabled(sparkConf, "kafka") &&
clusterConfig.isDefined && 
params.containsKey(SaslConfigs.SASL_JAAS_CONFIG)) {
  logDebug("Delegation token used by connector, checking if uses the latest 
token.")
  val connectorJaasParams = 
params.get(SaslConfigs.SASL_JAAS_CONFIG).asInstanceOf[String]
  getTokenJaasParams(clusterConfig.get) != connectorJaasParams
} else {
  false
}
  }
{code}

{code}
  def isServiceEnabled(sparkConf: SparkConf, serviceName: String): Boolean = {
val key = providerEnabledConfig.format(serviceName)

deprecatedProviderEnabledConfigs.foreach { pattern =>
  val deprecatedKey = pattern.format(serviceName)
  if (sparkConf.contains(deprecatedKey)) {
logWarning(s"${deprecatedKey} is deprecated.  Please use ${key} 
instead.")
  }
}

val isEnabledDeprecated = deprecatedProviderEnabledConfigs.forall { pattern 
=>
  sparkConf
.getOption(pattern.format(serviceName))
.map(_.toBoolean)
.getOrElse(true)
}

sparkConf
  .getOption(key)
  .map(_.toBoolean)
  .getOrElse(isEnabledDeprecated)
  }
{code}

With my test data and default config, Spark pulled 500 records per a poll from 
Kafka, which ended up "10,280,000" calls to get() which always calls 
getOrRetrieveConsumer(). A single call of KafkaTokenUtil.needTokenUpdate() 
wouldn't add significant overhead, but 10,000,000 calls make a significant 
difference. Assuming the case where delegation token is not applied, 
HadoopDelegationTokenManager.isServiceEnabled is the culprit on such huge 
overhead.

We could probably resolve the issue via short-term solution & long-term 
solution.

* short-term solution: change the order of check in needTokenUpdate, so that 
the performance hit is only affected when using delegation token. I'll raise a 
PR shortly.
* long-term solution(s): 1) optimize 
HadoopDelegationTokenManager.isServiceEnabled 2) find a way to reduce the 
occurrence of checking necessarily of token update.

Note that even with short-term solution, a slight performance hit is observed 
as it still does more things on the code path compared to Spark 2.4. Though I'd 
ignore it if it affects slightly, like less than 1%, or even slightly higher 
but the code addition is mandatory.


was (Author: kabhwan):
I've spent some time to trace the issue, and noticed SPARK-29054 (+SPARK-30495) 
caused performance regression (though the patch itself is doing the right 
thing).

{code}
  private[kafka010] def getOrRetrieveConsumer(): InternalKafkaConsumer = {
if (!_consumer.isDefined) {
  retrieveConsumer()
}
require(_consumer.isDefined, "Consumer must be defined")
if (KafkaTokenUtil.needTokenUpdate(SparkEnv.get.conf, 
_consumer.get.kafkaParamsWithSecurity,
_consumer.get.clusterConfig)) {
  logDebug("Cached consumer uses an old delegation token, invalidating.")
  releaseConsumer()
  consumerPool.invalidateKey(cacheKey)
  fetchedDataPool.invalidate(cacheKey)
  retrieveConsumer()
}
_consumer.get
  }
{code}

{code}
  def needTokenUpdate(
  sparkConf: SparkConf,
  params: ju.Map[String, Object],
  clusterConfig: Option[KafkaTokenClusterConf]): Boolean = {
if (HadoopDelegationTokenManager.isServiceEnabled(sparkConf, "kafka") &&
clusterConfig.isDefined && 
params.containsKey(SaslConfigs.SASL_JAAS_CONFIG)) {
  logDebug("Delegation token used by connector, checking if uses the latest 
token.")
  val connectorJaasParams = 
params.get(SaslConfigs.SASL_JAAS_CONFIG).asInstanceOf[String]
  getTokenJaasParams(clusterConfig.get) != connectorJaasParams
} else {
  false
}
  }
{code}

{code}
  def isServiceEnabled(sparkConf: SparkConf,

[jira] [Updated] (SPARK-34020) IndexOutOfBoundsException on merge of two pyspark frames

2021-01-05 Thread Darshat (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darshat updated SPARK-34020:

Affects Version/s: (was: 3.0.1)
   3.0.0

> IndexOutOfBoundsException on merge of two pyspark frames
> 
>
> Key: SPARK-34020
> URL: https://issues.apache.org/jira/browse/SPARK-34020
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Darshat
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34020) IndexOutOfBoundsException on merge of two pyspark frames

2021-01-05 Thread Darshat (Jira)

Darshat created SPARK-34020:
---

 Summary: IndexOutOfBoundsException on merge of two pyspark frames
 Key: SPARK-34020
 URL: https://issues.apache.org/jira/browse/SPARK-34020
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.1
Reporter: Darshat






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2021-01-05 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259399#comment-17259399
 ] 

Jungtaek Lim commented on SPARK-33635:
--

I've spent some time to trace the issue, and noticed SPARK-29054 (+SPARK-30495) 
caused performance regression (though the patch itself is doing the right 
thing).

{code}
  private[kafka010] def getOrRetrieveConsumer(): InternalKafkaConsumer = {
if (!_consumer.isDefined) {
  retrieveConsumer()
}
require(_consumer.isDefined, "Consumer must be defined")
if (KafkaTokenUtil.needTokenUpdate(SparkEnv.get.conf, 
_consumer.get.kafkaParamsWithSecurity,
_consumer.get.clusterConfig)) {
  logDebug("Cached consumer uses an old delegation token, invalidating.")
  releaseConsumer()
  consumerPool.invalidateKey(cacheKey)
  fetchedDataPool.invalidate(cacheKey)
  retrieveConsumer()
}
_consumer.get
  }
{code}

{code}
  def needTokenUpdate(
  sparkConf: SparkConf,
  params: ju.Map[String, Object],
  clusterConfig: Option[KafkaTokenClusterConf]): Boolean = {
if (HadoopDelegationTokenManager.isServiceEnabled(sparkConf, "kafka") &&
clusterConfig.isDefined && 
params.containsKey(SaslConfigs.SASL_JAAS_CONFIG)) {
  logDebug("Delegation token used by connector, checking if uses the latest 
token.")
  val connectorJaasParams = 
params.get(SaslConfigs.SASL_JAAS_CONFIG).asInstanceOf[String]
  getTokenJaasParams(clusterConfig.get) != connectorJaasParams
} else {
  false
}
  }
{code}

{code}
  def isServiceEnabled(sparkConf: SparkConf, serviceName: String): Boolean = {
val key = providerEnabledConfig.format(serviceName)

deprecatedProviderEnabledConfigs.foreach { pattern =>
  val deprecatedKey = pattern.format(serviceName)
  if (sparkConf.contains(deprecatedKey)) {
logWarning(s"${deprecatedKey} is deprecated.  Please use ${key} 
instead.")
  }
}

val isEnabledDeprecated = deprecatedProviderEnabledConfigs.forall { pattern 
=>
  sparkConf
.getOption(pattern.format(serviceName))
.map(_.toBoolean)
.getOrElse(true)
}

sparkConf
  .getOption(key)
  .map(_.toBoolean)
  .getOrElse(isEnabledDeprecated)
  }
{code}

With my test data creator, Spark pulled 500 records per a poll from Kafka, 
which ended up "10,280,000" calls to get() which always calls 
getOrRetrieveConsumer(). A single call of KafkaTokenUtil.needTokenUpdate() 
wouldn't add significant overhead, but 10,000,000 calls make a significant 
difference. Assuming the case where delegation token is not applied, 
HadoopDelegationTokenManager.isServiceEnabled is the culprit on such huge 
overhead.

We could probably resolve the issue via short-term solution & long-term 
solution.

* short-term solution: change the order of check in needTokenUpdate, so that 
the performance hit is only affected when using delegation token. I'll raise a 
PR shortly.
* long-term solution(s): 1) optimize 
HadoopDelegationTokenManager.isServiceEnabled 2) find a way to reduce the 
occurrence of checking necessarily of token update.

Note that even with short-term solution, a slight performance hit is observed 
as it still does more things on the code path compared to Spark 2.4. Though I'd 
ignore it if it affects slightly, like less than 1%, or even slightly higher 
but the code addition is mandatory.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been

[jira] [Assigned] (SPARK-33541) Group exception messages in catalyst/expressions

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33541:


Assignee: Apache Spark

> Group exception messages in catalyst/expressions
> 
>
> Key: SPARK-33541
> URL: https://issues.apache.org/jira/browse/SPARK-33541
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions'
> || Filename  ||   Count ||
> | Cast.scala|  18 |
> | ExprUtils.scala   |   2 |
> | Expression.scala  |   8 |
> | InterpretedUnsafeProjection.scala |   1 |
> | ScalaUDF.scala|   2 |
> | SelectedField.scala   |   3 |
> | SubExprEvaluationRuntime.scala|   1 |
> | arithmetic.scala  |   8 |
> | collectionOperations.scala|   4 |
> | complexTypeExtractors.scala   |   3 |
> | csvExpressions.scala  |   3 |
> | datetimeExpressions.scala |   4 |
> | decimalExpressions.scala  |   2 |
> | generators.scala  |   2 |
> | higherOrderFunctions.scala|   6 |
> | jsonExpressions.scala |   2 |
> | literals.scala|   3 |
> | misc.scala|   2 |
> | namedExpressions.scala|   1 |
> | ordering.scala|   1 |
> | package.scala |   1 |
> | regexpExpressions.scala   |   1 |
> | stringExpressions.scala   |   1 |
> | windowExpressions.scala   |   5 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate'
> || Filename||   Count ||
> | ApproximatePercentile.scala |   2 |
> | HyperLogLogPlusPlus.scala   |   1 |
> | Percentile.scala|   1 |
> | interfaces.scala|   2 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen'
> || Filename||   Count ||
> | CodeGenerator.scala |   5 |
> | javaCode.scala  |   1 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects'
> || Filename  ||   Count ||
> | objects.scala |  12 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33541) Group exception messages in catalyst/expressions

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33541:


Assignee: (was: Apache Spark)

> Group exception messages in catalyst/expressions
> 
>
> Key: SPARK-33541
> URL: https://issues.apache.org/jira/browse/SPARK-33541
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions'
> || Filename  ||   Count ||
> | Cast.scala|  18 |
> | ExprUtils.scala   |   2 |
> | Expression.scala  |   8 |
> | InterpretedUnsafeProjection.scala |   1 |
> | ScalaUDF.scala|   2 |
> | SelectedField.scala   |   3 |
> | SubExprEvaluationRuntime.scala|   1 |
> | arithmetic.scala  |   8 |
> | collectionOperations.scala|   4 |
> | complexTypeExtractors.scala   |   3 |
> | csvExpressions.scala  |   3 |
> | datetimeExpressions.scala |   4 |
> | decimalExpressions.scala  |   2 |
> | generators.scala  |   2 |
> | higherOrderFunctions.scala|   6 |
> | jsonExpressions.scala |   2 |
> | literals.scala|   3 |
> | misc.scala|   2 |
> | namedExpressions.scala|   1 |
> | ordering.scala|   1 |
> | package.scala |   1 |
> | regexpExpressions.scala   |   1 |
> | stringExpressions.scala   |   1 |
> | windowExpressions.scala   |   5 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate'
> || Filename||   Count ||
> | ApproximatePercentile.scala |   2 |
> | HyperLogLogPlusPlus.scala   |   1 |
> | Percentile.scala|   1 |
> | interfaces.scala|   2 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen'
> || Filename||   Count ||
> | CodeGenerator.scala |   5 |
> | javaCode.scala  |   1 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects'
> || Filename  ||   Count ||
> | objects.scala |  12 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33541) Group exception messages in catalyst/expressions

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259398#comment-17259398
 ] 

Apache Spark commented on SPARK-33541:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/31052

> Group exception messages in catalyst/expressions
> 
>
> Key: SPARK-33541
> URL: https://issues.apache.org/jira/browse/SPARK-33541
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions'
> || Filename  ||   Count ||
> | Cast.scala|  18 |
> | ExprUtils.scala   |   2 |
> | Expression.scala  |   8 |
> | InterpretedUnsafeProjection.scala |   1 |
> | ScalaUDF.scala|   2 |
> | SelectedField.scala   |   3 |
> | SubExprEvaluationRuntime.scala|   1 |
> | arithmetic.scala  |   8 |
> | collectionOperations.scala|   4 |
> | complexTypeExtractors.scala   |   3 |
> | csvExpressions.scala  |   3 |
> | datetimeExpressions.scala |   4 |
> | decimalExpressions.scala  |   2 |
> | generators.scala  |   2 |
> | higherOrderFunctions.scala|   6 |
> | jsonExpressions.scala |   2 |
> | literals.scala|   3 |
> | misc.scala|   2 |
> | namedExpressions.scala|   1 |
> | ordering.scala|   1 |
> | package.scala |   1 |
> | regexpExpressions.scala   |   1 |
> | stringExpressions.scala   |   1 |
> | windowExpressions.scala   |   5 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate'
> || Filename||   Count ||
> | ApproximatePercentile.scala |   2 |
> | HyperLogLogPlusPlus.scala   |   1 |
> | Percentile.scala|   1 |
> | interfaces.scala|   2 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen'
> || Filename||   Count ||
> | CodeGenerator.scala |   5 |
> | javaCode.scala  |   1 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects'
> || Filename  ||   Count ||
> | objects.scala |  12 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32685) Script transform hive serde default field.delimit is '\t'

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259395#comment-17259395
 ] 

Apache Spark commented on SPARK-32685:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/31051

> Script transform hive serde default field.delimit is '\t'
> -
>
> Key: SPARK-32685
> URL: https://issues.apache.org/jira/browse/SPARK-32685
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
>  
> {code:java}
> select split(value, "\t") from (
> SELECT TRANSFORM(a, b, c, null)
> USING 'cat' 
> FROM (select 1 as a, 2 as b, 3  as c) t
> ) temp;
> result is :
> _c0
> ["2","3","\\N"]{code}
>  
> {code:java}
> select split(value, "\t") from (
> SELECT TRANSFORM(a, b, c, null)
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> USING 'cat' 
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>   WITH SERDEPROPERTIES (
>'serialization.last.column.takes.rest' = 'true'
>   )
> FROM (select 1 as a, 2 as b, 3  as c) t
> ) temp;
> result is :
> _c0
> ["2","3","\\N"]{code}
>  
>  
>  
> {code:java}
> select split(value, "\t") from (
> SELECT TRANSFORM(a, b, c, null)
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> USING 'cat' 
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> FROM (select 1 as a, 2 as b, 3  as c) t
> ) temp;
> result is :
> _c0 
> ["2"]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32685) Script transform hive serde default field.delimit is '\t'

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259394#comment-17259394
 ] 

Apache Spark commented on SPARK-32685:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/31051

> Script transform hive serde default field.delimit is '\t'
> -
>
> Key: SPARK-32685
> URL: https://issues.apache.org/jira/browse/SPARK-32685
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
>  
> {code:java}
> select split(value, "\t") from (
> SELECT TRANSFORM(a, b, c, null)
> USING 'cat' 
> FROM (select 1 as a, 2 as b, 3  as c) t
> ) temp;
> result is :
> _c0
> ["2","3","\\N"]{code}
>  
> {code:java}
> select split(value, "\t") from (
> SELECT TRANSFORM(a, b, c, null)
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> USING 'cat' 
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>   WITH SERDEPROPERTIES (
>'serialization.last.column.takes.rest' = 'true'
>   )
> FROM (select 1 as a, 2 as b, 3  as c) t
> ) temp;
> result is :
> _c0
> ["2","3","\\N"]{code}
>  
>  
>  
> {code:java}
> select split(value, "\t") from (
> SELECT TRANSFORM(a, b, c, null)
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> USING 'cat' 
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> FROM (select 1 as a, 2 as b, 3  as c) t
> ) temp;
> result is :
> _c0 
> ["2"]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33982) Sparksql does not support when the inserted table is a read table

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33982.
--
Resolution: Duplicate

> Sparksql does not support when the inserted table is a read table
> -
>
> Key: SPARK-33982
> URL: https://issues.apache.org/jira/browse/SPARK-33982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When the inserted table is a read table, sparksql will throw an error - > 
> org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is 
> also being read from.;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33982) Sparksql does not support when the inserted table is a read table

2021-01-05 Thread hao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259391#comment-17259391
 ] 

hao commented on SPARK-33982:
-

[~hyukjin.kwon] 小哥，能看懂中文的呀！是的这是同一个问题>.<

> Sparksql does not support when the inserted table is a read table
> -
>
> Key: SPARK-33982
> URL: https://issues.apache.org/jira/browse/SPARK-33982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When the inserted table is a read table, sparksql will throw an error - > 
> org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is 
> also being read from.;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread maple (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377
 ] 

maple edited comment on SPARK-33981 at 1/6/21, 3:20 AM:


I try in Spark 2.4.7,it still exists.

!image-2021-01-06-11-20-49-804.png!
 !screenshot-1.png!!image-2021-01-06-11-19-09-849.png!


was (Author: 995582386):
I try in Spark 2.4.7,it still exists.
 !screenshot-1.png!!image-2021-01-06-11-19-09-849.png!

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png, 
> image-2021-01-06-11-19-09-849.png, image-2021-01-06-11-20-49-804.png, 
> screenshot-1.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread maple (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259377#comment-17259377
 ] 

maple commented on SPARK-33981:
---

I try in Spark 2.4.7,it still exists.
 !screenshot-1.png!!image-2021-01-06-11-19-09-849.png!

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png, 
> image-2021-01-06-11-19-09-849.png, screenshot-1.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread maple (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

maple updated SPARK-33981:
--
Attachment: image-2021-01-06-11-19-09-849.png

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png, 
> image-2021-01-06-11-19-09-849.png, screenshot-1.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread maple (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

maple updated SPARK-33981:
--
Attachment: screenshot-1.png

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png, screenshot-1.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted

2021-01-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33029.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30954
[https://github.com/apache/spark/pull/30954]

> Standalone mode blacklist executors page UI marks driver as blacklisted
> ---
>
> Key: SPARK-33029
> URL: https://issues.apache.org/jira/browse/SPARK-33029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Baohe Zhang
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png, Screen Shot 
> 2020-09-29 at 1.53.37 PM.png
>
>
> I am running a spark shell on a 1 node standalone cluster.  I noticed that 
> the executors page ui was marking the driver as blacklisted for the stage 
> that is running.  Attached a screen shot.
> Also, in my case one of the executors died and it doesn't seem like the 
> schedule rpicked up the new one.  It doesn't show up on the stages page and 
> just shows it as active but none of the tasks ran there.
>  
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
>  
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted

2021-01-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33029:
-

Assignee: Baohe Zhang

> Standalone mode blacklist executors page UI marks driver as blacklisted
> ---
>
> Key: SPARK-33029
> URL: https://issues.apache.org/jira/browse/SPARK-33029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Baohe Zhang
>Priority: Major
> Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png, Screen Shot 
> 2020-09-29 at 1.53.37 PM.png
>
>
> I am running a spark shell on a 1 node standalone cluster.  I noticed that 
> the executors page ui was marking the driver as blacklisted for the stage 
> that is running.  Attached a screen shot.
> Also, in my case one of the executors died and it doesn't seem like the 
> schedule rpicked up the new one.  It doesn't show up on the stages page and 
> just shows it as active but none of the tasks ran there.
>  
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
>  
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33986.
--
Resolution: Invalid

> Spark handle always return LOST status in standalone cluster mode with Spark 
> launcher
> -
>
> Key: SPARK-33986
> URL: https://issues.apache.org/jira/browse/SPARK-33986
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 2.4.4
> Environment: apache hadoop 2.6.5
> apache spark 2.4.4
>Reporter: ZhongyuWang
>Priority: Major
>
> I can use it to submit spark app successfully in standalone client/yarn 
> client/yarn cluster mode，and get correct app status, but when i submit spark 
> app in standalone cluster mode, Spark handle always return LOST status(once) 
> and app running stablely until FINISHED( handle wasn't get any state change 
> infomation).  I noticed when I submited app from code, after a while, the 
> SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher 
> redirect log) doesn't have any useful information.
> this is my pseudo code,
> {code:java}
> SparkAppHandle handle = launcher.startApplication(new 
> SparkAppHandle.Listener() {
> @Override
> public void stateChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> @Override
> public void infoChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> });{code}
> any idea ? thx



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher

2021-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259370#comment-17259370
 ] 

Hyukjin Kwon commented on SPARK-33986:
--

Then it sounds likely a env problem. For a question, I would encourage you to 
ask it in the mailing list (https://spark.apache.org/community.html) first 
before filing it as an issue.

> Spark handle always return LOST status in standalone cluster mode with Spark 
> launcher
> -
>
> Key: SPARK-33986
> URL: https://issues.apache.org/jira/browse/SPARK-33986
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 2.4.4
> Environment: apache hadoop 2.6.5
> apache spark 2.4.4
>Reporter: ZhongyuWang
>Priority: Major
>
> I can use it to submit spark app successfully in standalone client/yarn 
> client/yarn cluster mode，and get correct app status, but when i submit spark 
> app in standalone cluster mode, Spark handle always return LOST status(once) 
> and app running stablely until FINISHED( handle wasn't get any state change 
> infomation).  I noticed when I submited app from code, after a while, the 
> SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher 
> redirect log) doesn't have any useful information.
> this is my pseudo code,
> {code:java}
> SparkAppHandle handle = launcher.startApplication(new 
> SparkAppHandle.Listener() {
> @Override
> public void stateChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> @Override
> public void infoChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> });{code}
> any idea ? thx



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34012) Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259369#comment-17259369
 ] 

Apache Spark commented on SPARK-34012:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/31050

> Keep behavior consistent when conf 
> `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration 
> guide
> -
>
> Key: SPARK-34012
> URL: https://issues.apache.org/jira/browse/SPARK-34012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0, 3.2.0
>
>
> According to 
> [https://github.com/apache/spark/pull/29087#issuecomment-754389257]
> fix bug and add UT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34012) Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259368#comment-17259368
 ] 

Apache Spark commented on SPARK-34012:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/31050

> Keep behavior consistent when conf 
> `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration 
> guide
> -
>
> Key: SPARK-34012
> URL: https://issues.apache.org/jira/browse/SPARK-34012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0, 3.2.0
>
>
> According to 
> [https://github.com/apache/spark/pull/29087#issuecomment-754389257]
> fix bug and add UT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34015) SparkR partition timing summary reports input time correctly

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34015.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/31021

> SparkR partition timing summary reports input time correctly
> 
>
> Key: SPARK-34015
> URL: https://issues.apache.org/jira/browse/SPARK-34015
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.2, 3.0.1
> Environment: Observed on CentOS-7 running spark 2.3.1 and on my mac 
> running master
>Reporter: Tom Howland
>Assignee: Tom Howland
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> When sparkR is run at log level INFO, a summary of how the worker spent its 
> time processing the partition is printed. There is a logic error where it is 
> over-reporting the time inputting rows.
> In detail: the variable inputElap in a wider context is used to mark the 
> beginning of reading rows, but in the part changed here it was used as a 
> local variable for measuring compute time. Thus, the error is not observable 
> if there is only one group per partition, which is what you get in unit tests.
> For our application, here's what a log entry looks like before these changes 
> were applied:
> {{20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, 
> broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, 
> write-output = 0.020 s, total = 1021.546 s}}
> this indicates that we're spending more time reading rows than operating on 
> the rows.
> After these changes, it looks like this:
> {{20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, 
> broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, 
> write-output = 0.045 s, total = 1812.553 s}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34015) SparkR partition timing summary reports input time correctly

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34015:


Assignee: Tom Howland

> SparkR partition timing summary reports input time correctly
> 
>
> Key: SPARK-34015
> URL: https://issues.apache.org/jira/browse/SPARK-34015
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.2, 3.0.1
> Environment: Observed on CentOS-7 running spark 2.3.1 and on my mac 
> running master
>Reporter: Tom Howland
>Assignee: Tom Howland
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> When sparkR is run at log level INFO, a summary of how the worker spent its 
> time processing the partition is printed. There is a logic error where it is 
> over-reporting the time inputting rows.
> In detail: the variable inputElap in a wider context is used to mark the 
> beginning of reading rows, but in the part changed here it was used as a 
> local variable for measuring compute time. Thus, the error is not observable 
> if there is only one group per partition, which is what you get in unit tests.
> For our application, here's what a log entry looks like before these changes 
> were applied:
> {{20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, 
> broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, 
> write-output = 0.020 s, total = 1021.546 s}}
> this indicates that we're spending more time reading rows than operating on 
> the rows.
> After these changes, it looks like this:
> {{20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, 
> broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, 
> write-output = 0.045 s, total = 1812.553 s}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34015) SparkR partition timing summary reports input time correctly

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34015:
-
Fix Version/s: 3.1.1
   3.2.0

> SparkR partition timing summary reports input time correctly
> 
>
> Key: SPARK-34015
> URL: https://issues.apache.org/jira/browse/SPARK-34015
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.2, 3.0.1
> Environment: Observed on CentOS-7 running spark 2.3.1 and on my mac 
> running master
>Reporter: Tom Howland
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> When sparkR is run at log level INFO, a summary of how the worker spent its 
> time processing the partition is printed. There is a logic error where it is 
> over-reporting the time inputting rows.
> In detail: the variable inputElap in a wider context is used to mark the 
> beginning of reading rows, but in the part changed here it was used as a 
> local variable for measuring compute time. Thus, the error is not observable 
> if there is only one group per partition, which is what you get in unit tests.
> For our application, here's what a log entry looks like before these changes 
> were applied:
> {{20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, 
> broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, 
> write-output = 0.020 s, total = 1021.546 s}}
> this indicates that we're spending more time reading rows than operating on 
> the rows.
> After these changes, it looks like this:
> {{20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, 
> broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, 
> write-output = 0.045 s, total = 1812.553 s}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34011) ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34011:
-
Fix Version/s: 3.2.0

> ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
> 
>
> Key: SPARK-34011
> URL: https://issues.apache.org/jira/browse/SPARK-34011
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0
>
>
> Here is the example to reproduce the issue:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED 
> BY (part0);
> spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0;
> spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1;
> spark-sql> CACHE TABLE tbl1;
> spark-sql> SELECT * FROM tbl1;
> 0 0
> 1 1
> spark-sql> ALTER TABLE tbl1 PARTITION (part0 = 0) RENAME TO PARTITION (part0 
> = 2);
> spark-sql> SELECT * FROM tbl1;
> 0 0
> 1 1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34011) ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34011.
--
Resolution: Fixed

> ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
> 
>
> Key: SPARK-34011
> URL: https://issues.apache.org/jira/browse/SPARK-34011
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0
>
>
> Here is the example to reproduce the issue:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED 
> BY (part0);
> spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0;
> spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1;
> spark-sql> CACHE TABLE tbl1;
> spark-sql> SELECT * FROM tbl1;
> 0 0
> 1 1
> spark-sql> ALTER TABLE tbl1 PARTITION (part0 = 0) RENAME TO PARTITION (part0 
> = 2);
> spark-sql> SELECT * FROM tbl1;
> 0 0
> 1 1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34011) ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34011:
-
Fix Version/s: (was: 3.2.0)
   (was: 3.0.2)
   (was: 3.1.0)

> ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
> 
>
> Key: SPARK-34011
> URL: https://issues.apache.org/jira/browse/SPARK-34011
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
>
> Here is the example to reproduce the issue:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED 
> BY (part0);
> spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0;
> spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1;
> spark-sql> CACHE TABLE tbl1;
> spark-sql> SELECT * FROM tbl1;
> 0 0
> 1 1
> spark-sql> ALTER TABLE tbl1 PARTITION (part0 = 0) RENAME TO PARTITION (part0 
> = 2);
> spark-sql> SELECT * FROM tbl1;
> 0 0
> 1 1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34012) Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259355#comment-17259355
 ] 

Apache Spark commented on SPARK-34012:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/31049

> Keep behavior consistent when conf 
> `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration 
> guide
> -
>
> Key: SPARK-34012
> URL: https://issues.apache.org/jira/browse/SPARK-34012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0, 3.2.0
>
>
> According to 
> [https://github.com/apache/spark/pull/29087#issuecomment-754389257]
> fix bug and add UT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-05 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34002:
-
Component/s: (was: Spark Core)
 SQL

> Broken UDF Encoding
> ---
>
> Key: SPARK-34002
> URL: https://issues.apache.org/jira/browse/SPARK-34002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: Windows 10 Surface book 3
> Local Spark
> IntelliJ Idea
>  
>Reporter: Mark Hamilton
>Priority: Major
>
> UDFs can behave differently depending on if a dataframe is cached, despite 
> the dataframe being identical
>  
> Repro:
>  
> {code:java}
> import org.apache.spark.sql.expressions.UserDefinedFunction 
> import org.apache.spark.sql.functions.{col, udf}
> case class Bar(a: Int)
>  
> import spark.implicits._
> def f1(bar: Bar): Option[Bar] = {
>  None
> }
> def f2(bar: Bar): Option[Bar] = {
>  Option(bar)
> }
> val udf1: UserDefinedFunction = udf(f1 _)
> val udf2: UserDefinedFunction = udf(f2 _)
> // Commenting in the cache will make this example work
> val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
> val newDf = df
>  .withColumn("c1", udf1(col("c0")))
>  .withColumn("c2", udf2(col("c1")))
> newDf.show()
> {code}
>  
> Error:
> Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program 
> Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath 
> "C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program
>  Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program 
>

[jira] [Issue Comment Deleted] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher

2021-01-05 Thread ZhongyuWang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhongyuWang updated SPARK-33986:

Comment: was deleted

(was: Yes, other applications have the same scenario(in cluster mode, after a 
while, the SparkSubmit process was suddenly stopped). when i try to submit PI 
Spark example with spark-submit shell, SparkSubmit process was also stopped 
after submitted job to spark cluster(not supervise spark job))

> Spark handle always return LOST status in standalone cluster mode with Spark 
> launcher
> -
>
> Key: SPARK-33986
> URL: https://issues.apache.org/jira/browse/SPARK-33986
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 2.4.4
> Environment: apache hadoop 2.6.5
> apache spark 2.4.4
>Reporter: ZhongyuWang
>Priority: Major
>
> I can use it to submit spark app successfully in standalone client/yarn 
> client/yarn cluster mode，and get correct app status, but when i submit spark 
> app in standalone cluster mode, Spark handle always return LOST status(once) 
> and app running stablely until FINISHED( handle wasn't get any state change 
> infomation).  I noticed when I submited app from code, after a while, the 
> SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher 
> redirect log) doesn't have any useful information.
> this is my pseudo code,
> {code:java}
> SparkAppHandle handle = launcher.startApplication(new 
> SparkAppHandle.Listener() {
> @Override
> public void stateChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> @Override
> public void infoChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> });{code}
> any idea ? thx



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher

2021-01-05 Thread ZhongyuWang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259348#comment-17259348
 ] 

ZhongyuWang commented on SPARK-33986:
-

Yes, other applications have the same scenario(in cluster mode, after a while, 
the SparkSubmit process was suddenly stopped). when i try to submit PI Spark 
example with spark-submit shell, SparkSubmit process was also stopped after 
submitted job to spark cluster(not supervise spark job)

> Spark handle always return LOST status in standalone cluster mode with Spark 
> launcher
> -
>
> Key: SPARK-33986
> URL: https://issues.apache.org/jira/browse/SPARK-33986
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 2.4.4
> Environment: apache hadoop 2.6.5
> apache spark 2.4.4
>Reporter: ZhongyuWang
>Priority: Major
>
> I can use it to submit spark app successfully in standalone client/yarn 
> client/yarn cluster mode，and get correct app status, but when i submit spark 
> app in standalone cluster mode, Spark handle always return LOST status(once) 
> and app running stablely until FINISHED( handle wasn't get any state change 
> infomation).  I noticed when I submited app from code, after a while, the 
> SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher 
> redirect log) doesn't have any useful information.
> this is my pseudo code,
> {code:java}
> SparkAppHandle handle = launcher.startApplication(new 
> SparkAppHandle.Listener() {
> @Override
> public void stateChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> @Override
> public void infoChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> });{code}
> any idea ? thx



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher

2021-01-05 Thread ZhongyuWang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259346#comment-17259346
 ] 

ZhongyuWang commented on SPARK-33986:
-

Yes, other applications have the same scenario(in cluster mode, after a while, 
the SparkSubmit process was suddenly stopped). when i try to submit PI Spark 
example with spark-submit shell, SparkSubmit process was also stopped after 
submitted job to spark cluster(not supervise spark job)

> Spark handle always return LOST status in standalone cluster mode with Spark 
> launcher
> -
>
> Key: SPARK-33986
> URL: https://issues.apache.org/jira/browse/SPARK-33986
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 2.4.4
> Environment: apache hadoop 2.6.5
> apache spark 2.4.4
>Reporter: ZhongyuWang
>Priority: Major
>
> I can use it to submit spark app successfully in standalone client/yarn 
> client/yarn cluster mode，and get correct app status, but when i submit spark 
> app in standalone cluster mode, Spark handle always return LOST status(once) 
> and app running stablely until FINISHED( handle wasn't get any state change 
> infomation).  I noticed when I submited app from code, after a while, the 
> SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher 
> redirect log) doesn't have any useful information.
> this is my pseudo code,
> {code:java}
> SparkAppHandle handle = launcher.startApplication(new 
> SparkAppHandle.Listener() {
> @Override
> public void stateChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> @Override
> public void infoChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> });{code}
> any idea ? thx



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34019) Keep same quantiles of UI and restful API

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259344#comment-17259344
 ] 

Apache Spark commented on SPARK-34019:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/31048

> Keep same quantiles of UI and restful API
> -
>
> Key: SPARK-34019
> URL: https://issues.apache.org/jira/browse/SPARK-34019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Keep same quantiles of UI and restful API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34019) Keep same quantiles of UI and restful API

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34019:


Assignee: (was: Apache Spark)

> Keep same quantiles of UI and restful API
> -
>
> Key: SPARK-34019
> URL: https://issues.apache.org/jira/browse/SPARK-34019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Keep same quantiles of UI and restful API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34019) Keep same quantiles of UI and restful API

2021-01-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259343#comment-17259343
 ] 

Apache Spark commented on SPARK-34019:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/31048

> Keep same quantiles of UI and restful API
> -
>
> Key: SPARK-34019
> URL: https://issues.apache.org/jira/browse/SPARK-34019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Keep same quantiles of UI and restful API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34019) Keep same quantiles of UI and restful API

2021-01-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34019:


Assignee: Apache Spark

> Keep same quantiles of UI and restful API
> -
>
> Key: SPARK-34019
> URL: https://issues.apache.org/jira/browse/SPARK-34019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Keep same quantiles of UI and restful API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34019) Keep same quantiles of UI and restful API

2021-01-05 Thread angerszhu (Jira)

angerszhu created SPARK-34019:
-

 Summary: Keep same quantiles of UI and restful API
 Key: SPARK-34019
 URL: https://issues.apache.org/jira/browse/SPARK-34019
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


Keep same quantiles of UI and restful API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33986) Spark handle always return LOST status in standalone cluster mode with Spark launcher

2021-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259338#comment-17259338
 ] 

Hyukjin Kwon commented on SPARK-33986:
--

Do other applications work fine in cluster modes?

> Spark handle always return LOST status in standalone cluster mode with Spark 
> launcher
> -
>
> Key: SPARK-33986
> URL: https://issues.apache.org/jira/browse/SPARK-33986
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 2.4.4
> Environment: apache hadoop 2.6.5
> apache spark 2.4.4
>Reporter: ZhongyuWang
>Priority: Major
>
> I can use it to submit spark app successfully in standalone client/yarn 
> client/yarn cluster mode，and get correct app status, but when i submit spark 
> app in standalone cluster mode, Spark handle always return LOST status(once) 
> and app running stablely until FINISHED( handle wasn't get any state change 
> infomation).  I noticed when I submited app from code, after a while, the 
> SparkSubmit process was suddenly stopped. I checked sparkSubmit log(launcher 
> redirect log) doesn't have any useful information.
> this is my pseudo code,
> {code:java}
> SparkAppHandle handle = launcher.startApplication(new 
> SparkAppHandle.Listener() {
> @Override
> public void stateChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> @Override
> public void infoChanged(SparkAppHandle handle) {
> stateChangedHandle(handle.getAppId(), jobId, code, execId, 
> handle.getState(), driverInfo, request, infoLog, errorLog);
> }
> });{code}
> any idea ? thx



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34002:
-
Priority: Major  (was: Critical)

> Broken UDF Encoding
> ---
>
> Key: SPARK-34002
> URL: https://issues.apache.org/jira/browse/SPARK-34002
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Windows 10 Surface book 3
> Local Spark
> IntelliJ Idea
>  
>Reporter: Mark Hamilton
>Priority: Major
>
> UDFs can behave differently depending on if a dataframe is cached, despite 
> the dataframe being identical
>  
> Repro:
>  
> {code:java}
> import org.apache.spark.sql.expressions.UserDefinedFunction 
> import org.apache.spark.sql.functions.{col, udf}
> case class Bar(a: Int)
>  
> import spark.implicits._
> def f1(bar: Bar): Option[Bar] = {
>  None
> }
> def f2(bar: Bar): Option[Bar] = {
>  Option(bar)
> }
> val udf1: UserDefinedFunction = udf(f1 _)
> val udf2: UserDefinedFunction = udf(f2 _)
> // Commenting in the cache will make this example work
> val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
> val newDf = df
>  .withColumn("c1", udf1(col("c0")))
>  .withColumn("c2", udf2(col("c1")))
> newDf.show()
> {code}
>  
> Error:
> Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program 
> Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath 
> "C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program
>  Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program 
>

[jira] [Commented] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259336#comment-17259336
 ] 

Hyukjin Kwon commented on SPARK-33981:
--

[~995582386], Spark 2.3.3 is EOL. Can you see if the issue still exists in 
higher Spark versions?

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33982) Sparksql does not support when the inserted table is a read table

2021-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259335#comment-17259335
 ] 

Hyukjin Kwon commented on SPARK-33982:
--

[~hao.duan] how is it different from SPARK-34006?

> Sparksql does not support when the inserted table is a read table
> -
>
> Key: SPARK-33982
> URL: https://issues.apache.org/jira/browse/SPARK-33982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When the inserted table is a read table, sparksql will throw an error - > 
> org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is 
> also being read from.;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34016) SparkR not on CRAN (again)

2021-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259334#comment-17259334
 ] 

Hyukjin Kwon commented on SPARK-34016:
--

cc [~dongjoon], [~felixcheung], [~shivaram] FYI

> SparkR not on CRAN (again)
> --
>
> Key: SPARK-34016
> URL: https://issues.apache.org/jira/browse/SPARK-34016
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 2.4.7
>Reporter: Paul Johnson
>Priority: Major
>
> SparkR is removed from CRAN 
> [https://cran.r-project.org/web/packages/SparkR/index.html#:~:text=Package%20'SparkR'%20was%20removed%20from,be%20obtained%20from%20the%20archive.=A%20summary%20of%20the%20most,to%20link%20to%20this%20page.]
> Is there anything we can do to help?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34018) finishedExecutorWithRunningSidecar in test utils depends on nulls matching

2021-01-05 Thread Holden Karau (Jira)

Holden Karau created SPARK-34018:


 Summary: finishedExecutorWithRunningSidecar in test utils depends 
on nulls matching
 Key: SPARK-34018
 URL: https://issues.apache.org/jira/browse/SPARK-34018
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.1.0, 3.2.0, 3.1.1
Reporter: Holden Karau
Assignee: Holden Karau


Currently the test passes because we match the null on the container state 
string. To fix this label both the statuses and ensure the ExecutorPodSnapshot 
starts with the default config to match.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33935) Fix CBOs cost function

2021-01-05 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33935:
-
Fix Version/s: 2.4.8

> Fix CBOs cost function 
> ---
>
> Key: SPARK-33935
> URL: https://issues.apache.org/jira/browse/SPARK-33935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>
>
> The parameter spark.sql.cbo.joinReorder.card.weight is decumented as:
> {code:title=spark.sql.cbo.joinReorder.card.weight}
> The weight of cardinality (number of rows) for plan cost comparison in join 
> reorder: rows * weight + size * (1 - weight).
> {code}
> But in the implementation the formula is a bit different:
> {code:title=Current implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   if (other.planCost.card == 0 || other.planCost.size == 0) {
> false
>   } else {
> val relativeRows = BigDecimal(this.planCost.card) / 
> BigDecimal(other.planCost.card)
> val relativeSize = BigDecimal(this.planCost.size) / 
> BigDecimal(other.planCost.size)
> relativeRows * conf.joinReorderCardWeight +
>   relativeSize * (1 - conf.joinReorderCardWeight) < 1
>   }
> }
> {code}
> This change has an unfortunate consequence: 
> given two plans A and B, both A betterThan B and B betterThan A might give 
> the same results. This happes when one has many rows with small sizes and 
> other has few rows with large sizes.
> A example values, that have this fenomen with the default weight value (0.7):
> A.card = 500, B.card = 300
> A.size = 30, B.size = 80
> Both A betterThan B and B betterThan A would have score above 1 and would 
> return false.
> A new implementation is proposed, that matches the documentation:
> {code:title=Proposed implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   val oldCost = BigDecimal(this.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight)
>   val newCost = BigDecimal(other.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight)
>   newCost < oldCost
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34012) Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide

2021-01-05 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34012.
--
Fix Version/s: 3.2.0
   3.1.0
 Assignee: angerszhu
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31039

> Keep behavior consistent when conf 
> `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration 
> guide
> -
>
> Key: SPARK-34012
> URL: https://issues.apache.org/jira/browse/SPARK-34012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0, 3.2.0
>
>
> According to 
> [https://github.com/apache/spark/pull/29087#issuecomment-754389257]
> fix bug and add UT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34012) Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide

2021-01-05 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34012:
-
Affects Version/s: 3.1.0
   3.0.2
   2.4.8

> Keep behavior consistent when conf 
> `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration 
> guide
> -
>
> Key: SPARK-34012
> URL: https://issues.apache.org/jira/browse/SPARK-34012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> According to 
> [https://github.com/apache/spark/pull/29087#issuecomment-754389257]
> fix bug and add UT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33833) Allow Spark Structured Streaming report Kafka Lag through Burrow

2021-01-05 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259280#comment-17259280
 ] 

L. C. Hsieh commented on SPARK-33833:
-

Thanks [~kabhwan]. 

> Allow Spark Structured Streaming report Kafka Lag through Burrow
> 
>
> Key: SPARK-33833
> URL: https://issues.apache.org/jira/browse/SPARK-33833
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Sam Davarnia
>Priority: Major
>
> Because structured streaming tracks Kafka offset consumption by itself, 
> It is not possible to track total Kafka lag using Burrow similar to DStreams
> We have used Stream hooks as mentioned 
> [here|https://medium.com/@ronbarabash/how-to-measure-consumer-lag-in-spark-structured-streaming-6c3645e45a37]
>  
> It would be great if Spark supports this feature out of the box.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33833) Allow Spark Structured Streaming report Kafka Lag through Burrow

2021-01-05 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259263#comment-17259263
 ] 

Jungtaek Lim commented on SPARK-33833:
--

(Just to make clear, I was talking about setting the group ID on main purpose, 
not for SPARK-27549. If I'm objecting this, then I won't create my own project. 
If you think this is good to have, please raise a discussion thread to gather 
consensus on this, so that we can make progress without a risk to soft reject 
again.)

> Allow Spark Structured Streaming report Kafka Lag through Burrow
> 
>
> Key: SPARK-33833
> URL: https://issues.apache.org/jira/browse/SPARK-33833
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Sam Davarnia
>Priority: Major
>
> Because structured streaming tracks Kafka offset consumption by itself, 
> It is not possible to track total Kafka lag using Burrow similar to DStreams
> We have used Stream hooks as mentioned 
> [here|https://medium.com/@ronbarabash/how-to-measure-consumer-lag-in-spark-structured-streaming-6c3645e45a37]
>  
> It would be great if Spark supports this feature out of the box.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 186 matches

Mail list logo