[jira] [Commented] (SPARK-26059) Spark standalone mode, does not correctly record a failed Spark Job.
[ https://issues.apache.org/jira/browse/SPARK-26059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702809#comment-16702809 ] Prashant Sharma commented on SPARK-26059: - This will be a won't fix, as to fix it, 1) One would have to make a rest call on a failure of job, when inside client mode. Since the outcome of the rest call itself is not guaranteed, it can never be assured that the correct stated of failed application will always be reported. 2) A lot of things have to change. For example, currently standalone master does not track the progress of a job submitted in client mode. Introducing a new rest call, which allows marking the status of the Job, in client mode, etc. > Spark standalone mode, does not correctly record a failed Spark Job. > > > Key: SPARK-26059 > URL: https://issues.apache.org/jira/browse/SPARK-26059 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 3.0.0 >Reporter: Prashant Sharma >Priority: Major > > In order to reproduce submit a failing job to spark standalone master. The > status for the failed job is shown as FINISHED, irrespective of the fact it > failed or succeeded. > EDIT: It happens only when deploy-mode is client, and when deploy mode is > cluster it works as expected. > - Reported by: Surbhi Bakhtiyar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26059) Spark standalone mode, does not correctly record a failed Spark Job.
[ https://issues.apache.org/jira/browse/SPARK-26059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702809#comment-16702809 ] Prashant Sharma edited comment on SPARK-26059 at 11/29/18 7:33 AM: --- This will be a won't fix, as to fix it, 1) One would have to make a rest call on a failure of job i.e. on catch of an exception. This bug is only applicable to client mode of standalone deploy mode. Since the outcome of the rest call itself is not guaranteed, it can never be assured that the correct state of failed application will always be reported. 2) A lot of things have to change. For example, currently standalone master does not track the progress of a job submitted in client mode. Introducing a new rest call, which allows marking the status of the Job, in client mode, etc. was (Author: prashant_): This will be a won't fix, as to fix it, 1) One would have to make a rest call on a failure of job, when inside client mode. Since the outcome of the rest call itself is not guaranteed, it can never be assured that the correct stated of failed application will always be reported. 2) A lot of things have to change. For example, currently standalone master does not track the progress of a job submitted in client mode. Introducing a new rest call, which allows marking the status of the Job, in client mode, etc. > Spark standalone mode, does not correctly record a failed Spark Job. > > > Key: SPARK-26059 > URL: https://issues.apache.org/jira/browse/SPARK-26059 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 3.0.0 >Reporter: Prashant Sharma >Priority: Major > > In order to reproduce submit a failing job to spark standalone master. The > status for the failed job is shown as FINISHED, irrespective of the fact it > failed or succeeded. > EDIT: It happens only when deploy-mode is client, and when deploy mode is > cluster it works as expected. > - Reported by: Surbhi Bakhtiyar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26059) Spark standalone mode, does not correctly record a failed Spark Job.
[ https://issues.apache.org/jira/browse/SPARK-26059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-26059. - Resolution: Won't Fix > Spark standalone mode, does not correctly record a failed Spark Job. > > > Key: SPARK-26059 > URL: https://issues.apache.org/jira/browse/SPARK-26059 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 3.0.0 >Reporter: Prashant Sharma >Priority: Major > > In order to reproduce submit a failing job to spark standalone master. The > status for the failed job is shown as FINISHED, irrespective of the fact it > failed or succeeded. > EDIT: It happens only when deploy-mode is client, and when deploy mode is > cluster it works as expected. > - Reported by: Surbhi Bakhtiyar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26211) Fix InSet for binary, and struct and array with null.
[ https://issues.apache.org/jira/browse/SPARK-26211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26211: Assignee: (was: Apache Spark) > Fix InSet for binary, and struct and array with null. > - > > Key: SPARK-26211 > URL: https://issues.apache.org/jira/browse/SPARK-26211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Takuya Ueshin >Priority: Major > > Currently {{InSet}} doesn't work properly for binary type, or struct and > array type with null value in the set. > Because, as for binary type, the {{HashSet}} doesn't work properly for > {{Array[Byte]}}, and as for struct and array type with null value in the set, > the {{ordering}} will throw a {{NPE}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26211) Fix InSet for binary, and struct and array with null.
[ https://issues.apache.org/jira/browse/SPARK-26211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26211: Assignee: Apache Spark > Fix InSet for binary, and struct and array with null. > - > > Key: SPARK-26211 > URL: https://issues.apache.org/jira/browse/SPARK-26211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > Currently {{InSet}} doesn't work properly for binary type, or struct and > array type with null value in the set. > Because, as for binary type, the {{HashSet}} doesn't work properly for > {{Array[Byte]}}, and as for struct and array type with null value in the set, > the {{ordering}} will throw a {{NPE}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26211) Fix InSet for binary, and struct and array with null.
[ https://issues.apache.org/jira/browse/SPARK-26211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702801#comment-16702801 ] Apache Spark commented on SPARK-26211: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/23176 > Fix InSet for binary, and struct and array with null. > - > > Key: SPARK-26211 > URL: https://issues.apache.org/jira/browse/SPARK-26211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Takuya Ueshin >Priority: Major > > Currently {{InSet}} doesn't work properly for binary type, or struct and > array type with null value in the set. > Because, as for binary type, the {{HashSet}} doesn't work properly for > {{Array[Byte]}}, and as for struct and array type with null value in the set, > the {{ordering}} will throw a {{NPE}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26211) Fix InSet for binary, and struct and array with null.
Takuya Ueshin created SPARK-26211: - Summary: Fix InSet for binary, and struct and array with null. Key: SPARK-26211 URL: https://issues.apache.org/jira/browse/SPARK-26211 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.2 Reporter: Takuya Ueshin Currently {{InSet}} doesn't work properly for binary type, or struct and array type with null value in the set. Because, as for binary type, the {{HashSet}} doesn't work properly for {{Array[Byte]}}, and as for struct and array type with null value in the set, the {{ordering}} will throw a {{NPE}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode
[ https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] indraneel r updated SPARK-26206: Description: Spark structured streaming with kafka integration fails in update mode with compilation exception in code generation. Here's the code that was executed: {code:java} // code placeholder override def main(args: Array[String]): Unit = { val spark = SparkSession .builder .master("local[*]") .appName("SparkStreamingTest") .getOrCreate() val kafkaParams = Map[String, String]( "kafka.bootstrap.servers" -> "localhost:9092", "startingOffsets" -> "earliest", "subscribe" -> "test_events") val schema = Encoders.product[UserEvent].schema val query = spark.readStream.format("kafka") .options(kafkaParams) .load() .selectExpr("CAST(value AS STRING) as message") .select(from_json(col("message"), schema).as("json")) .select("json.*") .groupBy(window(col("event_time"), "10 minutes")) .count() .writeStream .foreachBatch { (batch: Dataset[Row], batchId: Long) => println(s"batch : ${batchId}") batch.show(false) } .outputMode("update") .start() query.awaitTermination() }{code} It succeeds for batch 0 but fails for batch 1 with following exception when more data is arrives in the stream. {code:java} 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, Column 18: A method named "putLong" is not declared in any enclosing class nor any supertype, nor through a static import org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, Column 18: A method named "putLong" is not declared in any enclosing class nor any supertype, nor through a static import at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060) at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781) at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760) at org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360) at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487) at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981) at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:432) at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:411) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:406) at
[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode
[ https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] indraneel r updated SPARK-26206: Description: Spark structured streaming with kafka integration fails in update mode with compilation exception in code generation. Here's the code that was executed: {code:java} // code placeholder override def main(args: Array[String]): Unit = { val spark = SparkSession .builder .master("local[*]") .appName("SparkStreamingTest") .getOrCreate() val kafkaParams = Map[String, String]( "kafka.bootstrap.servers" -> "localhost:9092", "startingOffsets" -> "earliest", "subscribe" -> "test_events") val schema = Encoders.product[UserEvent].schema val query = spark.readStream.format("kafka") .options(kafkaParams) .load() .selectExpr("CAST(value AS STRING) as message") .select(from_json(col("message"), schema).as("json")) .select("json.*") .groupBy(window(col("event_time"), "10 minutes")) .count() .writeStream .foreachBatch { (batch: Dataset[Row], batchId: Long) => println(s"batch : ${batchId}") batch.show(false) } .outputMode("update") .start() query.awaitTermination() }{code} It succeeds for batch 0 but fails for batch 1 with following exception when more data is arrives in the stream. {code:java} 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, Column 18: A method named "putLong" is not declared in any enclosing class nor any supertype, nor through a static import org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, Column 18: A method named "putLong" is not declared in any enclosing class nor any supertype, nor through a static import at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060) at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781) at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760) at org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360) at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487) at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981) at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:432) at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:411) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:406) at
[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode
[ https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] indraneel r updated SPARK-26206: Description: Spark structured streaming with kafka integration fails in update mode with compilation exception in code generation. Here's the code that was executed: {code:java} // code placeholder override def main(args: Array[String]): Unit = { val spark = SparkSession .builder .master("local[*]") .appName("SparkStreamingTest") .getOrCreate() val kafkaParams = Map[String, String]( "kafka.bootstrap.servers" -> "localhost:9092", "startingOffsets" -> "earliest", "subscribe" -> "test_events") val schema = Encoders.product[UserEvent].schema val query = spark.readStream.format("kafka") .options(kafkaParams) .load() .selectExpr("CAST(value AS STRING) as message") .select(from_json(col("message"), schema).as("json")) .select("json.*") .groupBy(window(col("event_time"), "10 minutes")) .count() .writeStream .foreachBatch { (batch: Dataset[Row], batchId: Long) => println(s"batch : ${batchId}") batch.show(false) } .outputMode("update") .start() query.awaitTermination() }{code} It succeeds for batch 0 but fails for batch 1 with following exception when more data is arrives in the stream. {code:java} 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, Column 18: A method named "putLong" is not declared in any enclosing class nor any supertype, nor through a static import org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, Column 18: A method named "putLong" is not declared in any enclosing class nor any supertype, nor through a static import at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060) at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781) at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760) at org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360) at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487) at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981) at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:432) at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:411) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:406) at
[jira] [Commented] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode
[ https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702757#comment-16702757 ] indraneel r commented on SPARK-26206: - [~kabhwan] Have added the details in the description. > Spark structured streaming with kafka integration fails in update mode > --- > > Key: SPARK-26206 > URL: https://issues.apache.org/jira/browse/SPARK-26206 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Operating system : MacOS Mojave > spark version : 2.4.0 > spark-sql-kafka-0-10 : 2.4.0 > kafka version 1.1.1 > scala version : 2.12.7 >Reporter: indraneel r >Priority: Blocker > > Spark structured streaming with kafka integration fails in update mode with > compilation exception in code generation. > Here's the code that was executed: > {code:java} > // code placeholder > override def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder > .master("local[*]") > .appName("SparkStreamingTest") > .getOrCreate() > > val kafkaParams = Map[String, String]( > "kafka.bootstrap.servers" -> "localhost:9092", > "startingOffsets" -> "earliest", > "subscribe" -> "test_events") > > val schema = Encoders.product[UserEvent].schema > val query = spark.readStream.format("kafka") > .options(kafkaParams) > .load() > .selectExpr("CAST(value AS STRING) as message") > .select(from_json(col("message"), schema).as("json")) > .select("json.*") > .groupBy(window(col("event_time"), "10 minutes")) > .count() > .writeStream > .foreachBatch { (batch: Dataset[Row], batchId: Long) => > println(s"batch : ${batchId}") > batch.show(false) > } > .outputMode("update") > .start() > query.awaitTermination() > }{code} > It succeeds for batch 0 but fails for batch 1 with following exception when > more data is arrives in the stream. > {code:java} > 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 25, Column 18: A method named "putLong" is not declared in any enclosing > class nor any supertype, nor through a static import > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 25, Column 18: A method named "putLong" is not declared in any enclosing > class nor any supertype, nor through a static import > at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060) > at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421) > at > org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781) > at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760) > at > org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360) > at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981) > at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215)
[jira] [Created] (SPARK-26210) Streaming of syslog
Manoj created SPARK-26210: - Summary: Streaming of syslog Key: SPARK-26210 URL: https://issues.apache.org/jira/browse/SPARK-26210 Project: Spark Issue Type: Question Components: Spark Submit Affects Versions: 2.0.2 Reporter: Manoj Hi Team, We are able to consumes the data from local hosts using the spark consumer. We are planning to capture the syslog from different host . is there any machanisam where spark process will listening the incoming sylog tcp messages . and also please let us know is flume can be replace by spark code ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702714#comment-16702714 ] xuqianjin commented on SPARK-23410: --- [~maxgekk] Thank you very much. I'll get started on this as soon as possible. > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26142) Implement shuffle read metrics in SQL
[ https://issues.apache.org/jira/browse/SPARK-26142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702687#comment-16702687 ] Apache Spark commented on SPARK-26142: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/23175 > Implement shuffle read metrics in SQL > - > > Key: SPARK-26142 > URL: https://issues.apache.org/jira/browse/SPARK-26142 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26133) Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-26133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26133: - Assignee: Liang-Chi Hsieh > Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to > OneHotEncoder > -- > > Key: SPARK-26133 > URL: https://issues.apache.org/jira/browse/SPARK-26133 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > We have deprecated OneHotEncoder at Spark 2.3.0 and introduced > OneHotEncoderEstimator. At 3.0.0, we remove deprecated OneHotEncoder and > rename OneHotEncoderEstimator to OneHotEncoder. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26133) Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-26133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-26133. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23100 [https://github.com/apache/spark/pull/23100] > Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to > OneHotEncoder > -- > > Key: SPARK-26133 > URL: https://issues.apache.org/jira/browse/SPARK-26133 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > We have deprecated OneHotEncoder at Spark 2.3.0 and introduced > OneHotEncoderEstimator. At 3.0.0, we remove deprecated OneHotEncoder and > rename OneHotEncoderEstimator to OneHotEncoder. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch
[ https://issues.apache.org/jira/browse/SPARK-26207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26207: -- Priority: Minor (was: Major) > add PowerIterationClustering (PIC) doc in 2.4 branch > - > > Key: SPARK-26207 > URL: https://issues.apache.org/jira/browse/SPARK-26207 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Minor > > add PIC documentation in docs/ml-clustering.md -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702515#comment-16702515 ] Reynold Xin commented on SPARK-24498: - Why don't we close the ticket? I heard we would get mixed performance, and it doesn't seem like worth it to add 500 lines of code for something that has unclear value. > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26188: Target Version/s: 2.4.1 > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Priority: Critical > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, > range: 0-662, partition values: [7a] > 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, > range: 0-662,
[jira] [Updated] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26188: Component/s: (was: Spark Core) SQL > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Priority: Critical > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, > range: 0-662, partition values: [7a] > 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: >
[jira] [Updated] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26188: Priority: Critical (was: Minor) > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Priority: Critical > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, > range: 0-662, partition values: [7a] > 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, >
[jira] [Commented] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode
[ https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702465#comment-16702465 ] Jungtaek Lim commented on SPARK-26206: -- [~indraneelrr] Could you also provide how UserEvent is constructed, and generated code if available? You can turn on debug log for "org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator" and rerun to get full source code. I think if exception occurs it should log generated code at INFO level. > Spark structured streaming with kafka integration fails in update mode > --- > > Key: SPARK-26206 > URL: https://issues.apache.org/jira/browse/SPARK-26206 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Operating system : MacOS Mojave > spark version : 2.4.0 > spark-sql-kafka-0-10 : 2.4.0 > kafka version 1.1.1 > scala version : 2.12.7 >Reporter: indraneel r >Priority: Blocker > > Spark structured streaming with kafka integration fails in update mode with > compilation exception in code generation. > Here's the code that was executed: > {code:java} > // code placeholder > override def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder > .master("local[*]") > .appName("SparkStreamingTest") > .getOrCreate() > > val kafkaParams = Map[String, String]( > "kafka.bootstrap.servers" -> "localhost:9092", > "startingOffsets" -> "earliest", > "subscribe" -> "test_events") > > val schema = Encoders.product[UserEvent].schema > val query = spark.readStream.format("kafka") > .options(kafkaParams) > .load() > .selectExpr("CAST(value AS STRING) as message") > .select(from_json(col("message"), schema).as("json")) > .select("json.*") > .groupBy(window(col("event_time"), "10 minutes")) > .count() > .writeStream > .foreachBatch { (batch: Dataset[Row], batchId: Long) => > println(s"batch : ${batchId}") > batch.show(false) > } > .outputMode("update") > .start() > query.awaitTermination() > }{code} > It succeeds for batch 0 but fails for batch 1 with following exception when > more data is arrives in the stream. > {code:java} > 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 25, Column 18: A method named "putLong" is not declared in any enclosing > class nor any supertype, nor through a static import > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 25, Column 18: A method named "putLong" is not declared in any enclosing > class nor any supertype, nor through a static import > at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060) > at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421) > at > org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781) > at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760) > at > org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360) > at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357) > at >
[jira] [Created] (SPARK-26209) Allow for dataframe bucketization without Hive
Walt Elder created SPARK-26209: -- Summary: Allow for dataframe bucketization without Hive Key: SPARK-26209 URL: https://issues.apache.org/jira/browse/SPARK-26209 Project: Spark Issue Type: Improvement Components: Input/Output, Java API, SQL Affects Versions: 2.4.0 Reporter: Walt Elder As a DataFrame author, I can elect to bucketize my output without involving Hive or HMS, so that my hive-less environment can benefit from this query-optimization technique. https://issues.apache.org/jira/browse/SPARK-19256?focusedCommentId=16345397=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16345397 identifies this as a shortcoming with the umbrella feature in provided via SPARK-19256. In short, relying on Hive to store metadata *precludes* environments which don't have/use hive from making use of bucketization features. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26194) Support automatic spark.authenticate secret in Kubernetes backend
[ https://issues.apache.org/jira/browse/SPARK-26194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702397#comment-16702397 ] Apache Spark commented on SPARK-26194: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/23174 > Support automatic spark.authenticate secret in Kubernetes backend > - > > Key: SPARK-26194 > URL: https://issues.apache.org/jira/browse/SPARK-26194 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > Currently k8s inherits the default behavior for {{spark.authenticate}}, which > is that the user must provide an auth secret. > k8s doesn't have that requirement and could instead generate its own unique > per-app secret, and propagate it to executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26194) Support automatic spark.authenticate secret in Kubernetes backend
[ https://issues.apache.org/jira/browse/SPARK-26194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26194: Assignee: (was: Apache Spark) > Support automatic spark.authenticate secret in Kubernetes backend > - > > Key: SPARK-26194 > URL: https://issues.apache.org/jira/browse/SPARK-26194 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > Currently k8s inherits the default behavior for {{spark.authenticate}}, which > is that the user must provide an auth secret. > k8s doesn't have that requirement and could instead generate its own unique > per-app secret, and propagate it to executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26194) Support automatic spark.authenticate secret in Kubernetes backend
[ https://issues.apache.org/jira/browse/SPARK-26194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702395#comment-16702395 ] Apache Spark commented on SPARK-26194: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/23174 > Support automatic spark.authenticate secret in Kubernetes backend > - > > Key: SPARK-26194 > URL: https://issues.apache.org/jira/browse/SPARK-26194 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > Currently k8s inherits the default behavior for {{spark.authenticate}}, which > is that the user must provide an auth secret. > k8s doesn't have that requirement and could instead generate its own unique > per-app secret, and propagate it to executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26194) Support automatic spark.authenticate secret in Kubernetes backend
[ https://issues.apache.org/jira/browse/SPARK-26194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26194: Assignee: Apache Spark > Support automatic spark.authenticate secret in Kubernetes backend > - > > Key: SPARK-26194 > URL: https://issues.apache.org/jira/browse/SPARK-26194 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > Currently k8s inherits the default behavior for {{spark.authenticate}}, which > is that the user must provide an auth secret. > k8s doesn't have that requirement and could instead generate its own unique > per-app secret, and propagate it to executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702363#comment-16702363 ] Dave DeCaprio commented on SPARK-23904: --- I've created a pull request that will address this. It limits the size of these debug strings. https://github.com/apache/spark/pull/23169 > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory
[ https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702362#comment-16702362 ] Dave DeCaprio commented on SPARK-25380: --- I've created a PR that should address this. It limits the size of text plans that are created. [https://github.com/apache/spark/pull/23169] > Generated plans occupy over 50% of Spark driver memory > -- > > Key: SPARK-25380 > URL: https://issues.apache.org/jira/browse/SPARK-25380 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 (AWS emr-5.16.0) > >Reporter: Michael Spector >Priority: Minor > Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot > 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png > > > When debugging an OOM exception during long run of a Spark application (many > iterations of the same code) I've found that generated plans occupy most of > the driver memory. I'm not sure whether this is a memory leak or not, but it > would be helpful if old plans could be purged from memory anyways. > Attached are screenshots of OOM heap dump opened in JVisualVM. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26208) Empty dataframe does not roundtrip for csv with header
[ https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26208: Assignee: (was: Apache Spark) > Empty dataframe does not roundtrip for csv with header > -- > > Key: SPARK-26208 > URL: https://issues.apache.org/jira/browse/SPARK-26208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: master branch, > commit 034ae305c33b1990b3c1a284044002874c343b4d, > date: Sun Nov 18 16:02:15 2018 +0800 >Reporter: koert kuipers >Priority: Minor > > when we write empty part file for csv and header=true we fail to write > header. the result cannot be read back in. > when header=true a part file with zero rows should still have header -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26208) Empty dataframe does not roundtrip for csv with header
[ https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702345#comment-16702345 ] Apache Spark commented on SPARK-26208: -- User 'koertkuipers' has created a pull request for this issue: https://github.com/apache/spark/pull/23173 > Empty dataframe does not roundtrip for csv with header > -- > > Key: SPARK-26208 > URL: https://issues.apache.org/jira/browse/SPARK-26208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: master branch, > commit 034ae305c33b1990b3c1a284044002874c343b4d, > date: Sun Nov 18 16:02:15 2018 +0800 >Reporter: koert kuipers >Priority: Minor > > when we write empty part file for csv and header=true we fail to write > header. the result cannot be read back in. > when header=true a part file with zero rows should still have header -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26208) Empty dataframe does not roundtrip for csv with header
[ https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26208: Assignee: Apache Spark > Empty dataframe does not roundtrip for csv with header > -- > > Key: SPARK-26208 > URL: https://issues.apache.org/jira/browse/SPARK-26208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: master branch, > commit 034ae305c33b1990b3c1a284044002874c343b4d, > date: Sun Nov 18 16:02:15 2018 +0800 >Reporter: koert kuipers >Assignee: Apache Spark >Priority: Minor > > when we write empty part file for csv and header=true we fail to write > header. the result cannot be read back in. > when header=true a part file with zero rows should still have header -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support
[ https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702313#comment-16702313 ] Apache Spark commented on SPARK-25957: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/23172 > Skip building spark-r docker image if spark distribution does not have R > support > > > Key: SPARK-25957 > URL: https://issues.apache.org/jira/browse/SPARK-25957 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Nagaram Prasad Addepally >Priority: Major > Fix For: 3.0.0 > > > [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh] > script by default tries to build spark-r image. We may not always build > spark distribution with R support. It would be good to skip building and > publishing spark-r images if R support is not available in the spark > distribution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26205) Optimize In expression for bytes, shorts, ints
[ https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702295#comment-16702295 ] Apache Spark commented on SPARK-26205: -- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/23171 > Optimize In expression for bytes, shorts, ints > -- > > Key: SPARK-26205 > URL: https://issues.apache.org/jira/browse/SPARK-26205 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > Currently, {{In}} expressions are compiled into a sequence of if-else > statements, which results in O\(n\) time complexity. {{InSet}} is an > optimized version of {{In}}, which is supposed to improve the performance if > the number of elements is big enough. However, {{InSet}} actually degrades > the performance in many cases due to various reasons (benchmarks will be > available in SPARK-26203 and solutions are discussed in SPARK-26204). > The main idea of this JIRA is to make use of {{tableswitch}} and > {{lookupswitch}} bytecode instructions. In short, we can improve our time > complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java > {{switch}} statements. We will have O\(1\) time complexity if our case values > are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will > give us O\(log n\). > An important benefit of the proposed approach is that we do not have to pay > an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we > can substantially outperform {{InSet}} even on 250+ elements. > See > [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] > and > [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] > for more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26205) Optimize In expression for bytes, shorts, ints
[ https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702293#comment-16702293 ] Apache Spark commented on SPARK-26205: -- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/23171 > Optimize In expression for bytes, shorts, ints > -- > > Key: SPARK-26205 > URL: https://issues.apache.org/jira/browse/SPARK-26205 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > Currently, {{In}} expressions are compiled into a sequence of if-else > statements, which results in O\(n\) time complexity. {{InSet}} is an > optimized version of {{In}}, which is supposed to improve the performance if > the number of elements is big enough. However, {{InSet}} actually degrades > the performance in many cases due to various reasons (benchmarks will be > available in SPARK-26203 and solutions are discussed in SPARK-26204). > The main idea of this JIRA is to make use of {{tableswitch}} and > {{lookupswitch}} bytecode instructions. In short, we can improve our time > complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java > {{switch}} statements. We will have O\(1\) time complexity if our case values > are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will > give us O\(log n\). > An important benefit of the proposed approach is that we do not have to pay > an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we > can substantially outperform {{InSet}} even on 250+ elements. > See > [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] > and > [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] > for more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26205) Optimize In expression for bytes, shorts, ints
[ https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26205: Assignee: Apache Spark > Optimize In expression for bytes, shorts, ints > -- > > Key: SPARK-26205 > URL: https://issues.apache.org/jira/browse/SPARK-26205 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Assignee: Apache Spark >Priority: Major > > Currently, {{In}} expressions are compiled into a sequence of if-else > statements, which results in O\(n\) time complexity. {{InSet}} is an > optimized version of {{In}}, which is supposed to improve the performance if > the number of elements is big enough. However, {{InSet}} actually degrades > the performance in many cases due to various reasons (benchmarks will be > available in SPARK-26203 and solutions are discussed in SPARK-26204). > The main idea of this JIRA is to make use of {{tableswitch}} and > {{lookupswitch}} bytecode instructions. In short, we can improve our time > complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java > {{switch}} statements. We will have O\(1\) time complexity if our case values > are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will > give us O\(log n\). > An important benefit of the proposed approach is that we do not have to pay > an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we > can substantially outperform {{InSet}} even on 250+ elements. > See > [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] > and > [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] > for more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26205) Optimize In expression for bytes, shorts, ints
[ https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26205: Assignee: (was: Apache Spark) > Optimize In expression for bytes, shorts, ints > -- > > Key: SPARK-26205 > URL: https://issues.apache.org/jira/browse/SPARK-26205 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > Currently, {{In}} expressions are compiled into a sequence of if-else > statements, which results in O\(n\) time complexity. {{InSet}} is an > optimized version of {{In}}, which is supposed to improve the performance if > the number of elements is big enough. However, {{InSet}} actually degrades > the performance in many cases due to various reasons (benchmarks will be > available in SPARK-26203 and solutions are discussed in SPARK-26204). > The main idea of this JIRA is to make use of {{tableswitch}} and > {{lookupswitch}} bytecode instructions. In short, we can improve our time > complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java > {{switch}} statements. We will have O\(1\) time complexity if our case values > are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will > give us O\(log n\). > An important benefit of the proposed approach is that we do not have to pay > an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we > can substantially outperform {{InSet}} even on 250+ elements. > See > [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] > and > [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] > for more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702225#comment-16702225 ] mayur commented on SPARK-14948: --- I am also facing this issue . any idea ETA would be great to know ! > Exception when joining DataFrames derived form the same DataFrame > - > > Key: SPARK-14948 > URL: https://issues.apache.org/jira/browse/SPARK-14948 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Saurabh Santhosh >Priority: Major > > h2. Spark Analyser is throwing the following exception in a specific scenario > : > h2. Exception : > org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing > from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > h2. Code : > {code:title=SparkClient.java|borderStyle=solid} > StructField[] fields = new StructField[2]; > fields[0] = new StructField("F1", DataTypes.StringType, true, > Metadata.empty()); > fields[1] = new StructField("F2", DataTypes.StringType, true, > Metadata.empty()); > JavaRDD rdd = > > sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a", > "b"))); > DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new > StructType(fields)); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1"); > DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as > asd, F2 from t1"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, > "t2"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3"); > > DataFrame join = aliasedDf.join(df, > aliasedDf.col("F2").equalTo(df.col("F2")), "inner"); > DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1")); > select.collect(); > {code} > h2. Observations : > * This issue is related to the Data Type of Fields of the initial Data > Frame.(If the Data Type is not String, it will work.) > * It works fine if the data frame is registered as a temporary table and an > sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26208) Empty dataframe does not roundtrip for csv with header
koert kuipers created SPARK-26208: - Summary: Empty dataframe does not roundtrip for csv with header Key: SPARK-26208 URL: https://issues.apache.org/jira/browse/SPARK-26208 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Environment: master branch, commit 034ae305c33b1990b3c1a284044002874c343b4d, date: Sun Nov 18 16:02:15 2018 +0800 Reporter: koert kuipers when we write empty part file for csv and header=true we fail to write header. the result cannot be read back in. when header=true a part file with zero rows should still have header -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26103) OutOfMemory error with large query plans
[ https://issues.apache.org/jira/browse/SPARK-26103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702213#comment-16702213 ] Apache Spark commented on SPARK-26103: -- User 'DaveDeCaprio' has created a pull request for this issue: https://github.com/apache/spark/pull/23169 > OutOfMemory error with large query plans > > > Key: SPARK-26103 > URL: https://issues.apache.org/jira/browse/SPARK-26103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Environment: Amazon EMR 5.19 > 1 c5.4xlarge master instance > 1 c5.4xlarge core instance > 2 c5.4xlarge task instances >Reporter: Dave DeCaprio >Priority: Major > > Large query plans can cause OutOfMemory errors in the Spark driver. > We are creating data frames that are not extremely large but contain lots of > nested joins. These plans execute efficiently because of caching and > partitioning, but the text version of the query plans generated can be > hundreds of megabytes. Running many of these in parallel causes our driver > process to fail. > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at > java.util.Arrays.copyOfRange(Arrays.java:2694) at > java.lang.String.(String.java:203) at > java.lang.StringBuilder.toString(StringBuilder.java:405) at > scala.StringContext.standardInterpolator(StringContext.scala:125) at > scala.StringContext.s(StringContext.scala:90) at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > > > A similar error is reported in > [https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format] > > Code exists to truncate the string if the number of output columns is larger > than 25, but not if the rest of the query plan is huge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24423) Add a new option `query` for JDBC sources
[ https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702228#comment-16702228 ] Apache Spark commented on SPARK-24423: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/23170 > Add a new option `query` for JDBC sources > - > > Key: SPARK-24423 > URL: https://issues.apache.org/jira/browse/SPARK-24423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.4.0 > > > Currently, our JDBC connector provides the option `dbtable` for users to > specify the to-be-loaded JDBC source table. > {code} > val jdbcDf = spark.read > .format("jdbc") > .option("*dbtable*", "dbName.tableName") > .options(jdbcCredentials: Map) > .load() > {code} > Normally, users do not fetch the whole JDBC table due to the poor > performance/throughput of JDBC. Thus, they normally just fetch a small set of > tables. For advanced users, they can pass a subquery as the option. > {code} > val query = """ (select * from tableName limit 10) as tmp """ > val jdbcDf = spark.read > .format("jdbc") > .option("*dbtable*", query) > .options(jdbcCredentials: Map) > .load() > {code} > However, this is straightforward to end users. We should simply allow users > to specify the query by a new option `query`. We will handle the complexity > for them. > {code} > val query = """select * from tableName limit 10""" > val jdbcDf = spark.read > .format("jdbc") > .option("*{color:#ff}query{color}*", query) > .options(jdbcCredentials: Map) > .load() > {code} > Users are not allowed to specify query and dbtable at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26103) OutOfMemory error with large query plans
[ https://issues.apache.org/jira/browse/SPARK-26103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702219#comment-16702219 ] Apache Spark commented on SPARK-26103: -- User 'DaveDeCaprio' has created a pull request for this issue: https://github.com/apache/spark/pull/23169 > OutOfMemory error with large query plans > > > Key: SPARK-26103 > URL: https://issues.apache.org/jira/browse/SPARK-26103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Environment: Amazon EMR 5.19 > 1 c5.4xlarge master instance > 1 c5.4xlarge core instance > 2 c5.4xlarge task instances >Reporter: Dave DeCaprio >Priority: Major > > Large query plans can cause OutOfMemory errors in the Spark driver. > We are creating data frames that are not extremely large but contain lots of > nested joins. These plans execute efficiently because of caching and > partitioning, but the text version of the query plans generated can be > hundreds of megabytes. Running many of these in parallel causes our driver > process to fail. > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at > java.util.Arrays.copyOfRange(Arrays.java:2694) at > java.lang.String.(String.java:203) at > java.lang.StringBuilder.toString(StringBuilder.java:405) at > scala.StringContext.standardInterpolator(StringContext.scala:125) at > scala.StringContext.s(StringContext.scala:90) at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > > > A similar error is reported in > [https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format] > > Code exists to truncate the string if the number of output columns is larger > than 25, but not if the rest of the query plan is huge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch
[ https://issues.apache.org/jira/browse/SPARK-26207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702205#comment-16702205 ] Apache Spark commented on SPARK-26207: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/23168 > add PowerIterationClustering (PIC) doc in 2.4 branch > - > > Key: SPARK-26207 > URL: https://issues.apache.org/jira/browse/SPARK-26207 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Major > > add PIC documentation in docs/ml-clustering.md -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch
[ https://issues.apache.org/jira/browse/SPARK-26207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702204#comment-16702204 ] Apache Spark commented on SPARK-26207: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/23168 > add PowerIterationClustering (PIC) doc in 2.4 branch > - > > Key: SPARK-26207 > URL: https://issues.apache.org/jira/browse/SPARK-26207 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Major > > add PIC documentation in docs/ml-clustering.md -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch
[ https://issues.apache.org/jira/browse/SPARK-26207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26207: Assignee: (was: Apache Spark) > add PowerIterationClustering (PIC) doc in 2.4 branch > - > > Key: SPARK-26207 > URL: https://issues.apache.org/jira/browse/SPARK-26207 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Major > > add PIC documentation in docs/ml-clustering.md -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch
[ https://issues.apache.org/jira/browse/SPARK-26207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26207: Assignee: Apache Spark > add PowerIterationClustering (PIC) doc in 2.4 branch > - > > Key: SPARK-26207 > URL: https://issues.apache.org/jira/browse/SPARK-26207 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > > add PIC documentation in docs/ml-clustering.md -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch
Huaxin Gao created SPARK-26207: -- Summary: add PowerIterationClustering (PIC) doc in 2.4 branch Key: SPARK-26207 URL: https://issues.apache.org/jira/browse/SPARK-26207 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 2.4.0 Reporter: Huaxin Gao add PIC documentation in docs/ml-clustering.md -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode
[ https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] indraneel r updated SPARK-26206: Environment: Operating system : MacOS Mojave spark version : 2.4.0 spark-sql-kafka-0-10 : 2.4.0 kafka version 1.1.1 scala version : 2.12.7 was: Operating system : MacOS Mojave spark version : 2.4.0 spark-streaming-kafka-0-10 : 2.4.0 kafka version 1.1.1 scala version : 2.12.7 > Spark structured streaming with kafka integration fails in update mode > --- > > Key: SPARK-26206 > URL: https://issues.apache.org/jira/browse/SPARK-26206 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Operating system : MacOS Mojave > spark version : 2.4.0 > spark-sql-kafka-0-10 : 2.4.0 > kafka version 1.1.1 > scala version : 2.12.7 >Reporter: indraneel r >Priority: Blocker > > Spark structured streaming with kafka integration fails in update mode with > compilation exception in code generation. > Here's the code that was executed: > {code:java} > // code placeholder > override def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder > .master("local[*]") > .appName("SparkStreamingTest") > .getOrCreate() > > val kafkaParams = Map[String, String]( > "kafka.bootstrap.servers" -> "localhost:9092", > "startingOffsets" -> "earliest", > "subscribe" -> "test_events") > > val schema = Encoders.product[UserEvent].schema > val query = spark.readStream.format("kafka") > .options(kafkaParams) > .load() > .selectExpr("CAST(value AS STRING) as message") > .select(from_json(col("message"), schema).as("json")) > .select("json.*") > .groupBy(window(col("event_time"), "10 minutes")) > .count() > .writeStream > .foreachBatch { (batch: Dataset[Row], batchId: Long) => > println(s"batch : ${batchId}") > batch.show(false) > } > .outputMode("update") > .start() > query.awaitTermination() > }{code} > It succeeds for batch 0 but fails for batch 1 with following exception when > more data is arrives in the stream. > {code:java} > 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 25, Column 18: A method named "putLong" is not declared in any enclosing > class nor any supertype, nor through a static import > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 25, Column 18: A method named "putLong" is not declared in any enclosing > class nor any supertype, nor through a static import > at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060) > at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421) > at > org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781) > at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760) > at > org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360) > at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330) > at
[jira] [Commented] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour
[ https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702146#comment-16702146 ] Apache Spark commented on SPARK-26024: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/23167 > Dataset API: repartitionByRange(...) has inconsistent behaviour > --- > > Key: SPARK-26024 > URL: https://issues.apache.org/jira/browse/SPARK-26024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Environment: Spark version 2.3.2 >Reporter: Julien Peloton >Assignee: Julien Peloton >Priority: Major > Labels: dataFrame, partitioning, repartition, spark-sql > Fix For: 3.0.0 > > > Hi, > I recently played with the {{repartitionByRange}} method for DataFrame > introduced in SPARK-22614. For DataFrames larger than the one tested in the > code (which has only 10 elements), the code sends back random results. > As a test for showing the inconsistent behaviour, I start as the unit code > used to test {{repartitionByRange}} > ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352]) > but I increase the size of the initial array to 1000, repartition using 3 > partitions, and count the number of element per-partitions: > > {code} > // Shuffle numbers from 0 to 1000, and make a DataFrame > val df = Random.shuffle(0.to(1000)).toDF("val") > // Repartition it using 3 partitions > // Sum up number of elements in each partition, and collect it. > // And do it several times > for (i <- 0 to 9) { > var counts = df.repartitionByRange(3, col("val")) > .mapPartitions{part => Iterator(part.size)} > .collect() > println(counts.toList) > } > // -> the number of elements in each partition varies... > {code} > I do not know whether it is expected (I will dig further in the code), but it > sounds like a bug. > Or I just misinterpret what {{repartitionByRange}} is for? > Any ideas? > Thanks! > Julien -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode
[ https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] indraneel r updated SPARK-26206: Environment: Operating system : MacOS Mojave spark version : 2.4.0 spark-streaming-kafka-0-10 : 2.4.0 kafka version 1.1.1 scala version : 2.12.7 was: Operating system : MacOS Mojave spark version : 2.4.0 spark-streaming-kafka-0-10 : 2.4.0 kafka version 1.1.1 > Spark structured streaming with kafka integration fails in update mode > --- > > Key: SPARK-26206 > URL: https://issues.apache.org/jira/browse/SPARK-26206 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Operating system : MacOS Mojave > spark version : 2.4.0 > spark-streaming-kafka-0-10 : 2.4.0 > kafka version 1.1.1 > scala version : 2.12.7 >Reporter: indraneel r >Priority: Blocker > > Spark structured streaming with kafka integration fails in update mode with > compilation exception in code generation. > Here's the code that was executed: > {code:java} > // code placeholder > override def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder > .master("local[*]") > .appName("SparkStreamingTest") > .getOrCreate() > > val kafkaParams = Map[String, String]( > "kafka.bootstrap.servers" -> "localhost:9092", > "startingOffsets" -> "earliest", > "subscribe" -> "test_events") > > val schema = Encoders.product[UserEvent].schema > val query = spark.readStream.format("kafka") > .options(kafkaParams) > .load() > .selectExpr("CAST(value AS STRING) as message") > .select(from_json(col("message"), schema).as("json")) > .select("json.*") > .groupBy(window(col("event_time"), "10 minutes")) > .count() > .writeStream > .foreachBatch { (batch: Dataset[Row], batchId: Long) => > println(s"batch : ${batchId}") > batch.show(false) > } > .outputMode("update") > .start() > query.awaitTermination() > }{code} > It succeeds for batch 0 but fails for batch 1 with following exception when > more data is arrives in the stream. > {code:java} > 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 25, Column 18: A method named "putLong" is not declared in any enclosing > class nor any supertype, nor through a static import > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 25, Column 18: A method named "putLong" is not declared in any enclosing > class nor any supertype, nor through a static import > at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060) > at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421) > at > org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781) > at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760) > at > org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360) > at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330) > at
[jira] [Commented] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled
[ https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702118#comment-16702118 ] Apache Spark commented on SPARK-26201: -- User 'redsanket' has created a pull request for this issue: https://github.com/apache/spark/pull/23166 > python broadcast.value on driver fails with disk encryption enabled > --- > > Key: SPARK-26201 > URL: https://issues.apache.org/jira/browse/SPARK-26201 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2 >Reporter: Thomas Graves >Priority: Major > > I was trying python with rpc and disk encryption enabled and when I tried a > python broadcast variable and just read the value back on the driver side the > job failed with: > > Traceback (most recent call last): File "broadcast.py", line 37, in > words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value > File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File > "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of > input > To reproduce use configs: --conf spark.network.crypto.enabled=true --conf > spark.io.encryption.enabled=true > > Code: > words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"]) > words_new.value > print(words_new.value) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26205) Optimize In expression for bytes, shorts, ints
Anton Okolnychyi created SPARK-26205: Summary: Optimize In expression for bytes, shorts, ints Key: SPARK-26205 URL: https://issues.apache.org/jira/browse/SPARK-26205 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Anton Okolnychyi Currently, {{In}} expressions are compiled into a sequence of if-else statements, which results in O\(n\) time complexity. {{InSet}} is an optimized version of {{In}}, which is supposed to improve the performance if the number of elements is big enough. However, {{InSet}} actually degrades the performance in many cases due to various reasons (benchmarks will be available in SPARK-26203 and solutions are discussed in SPARK-26204). The main idea of this JIRA is to make use of {{tableswitch}} and {{lookupswitch}} bytecode instructions. In short, we can improve our time complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java {{switch}} statements. We will have O\(1\) time complexity if our case values are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will give us O\(log n\). An important benefit of the proposed approach is that we do not have to pay an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we can substantially outperform {{InSet}} even on 250+ elements. See [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] and here for [more|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode
indraneel r created SPARK-26206: --- Summary: Spark structured streaming with kafka integration fails in update mode Key: SPARK-26206 URL: https://issues.apache.org/jira/browse/SPARK-26206 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Environment: Operating system : MacOS Mojave spark version : 2.4.0 spark-streaming-kafka-0-10 : 2.4.0 kafka version 1.1.1 Reporter: indraneel r Spark structured streaming with kafka integration fails in update mode with compilation exception in code generation. Here's the code that was executed: {code:java} // code placeholder override def main(args: Array[String]): Unit = { val spark = SparkSession .builder .master("local[*]") .appName("SparkStreamingTest") .getOrCreate() val kafkaParams = Map[String, String]( "kafka.bootstrap.servers" -> "localhost:9092", "startingOffsets" -> "earliest", "subscribe" -> "test_events") val schema = Encoders.product[UserEvent].schema val query = spark.readStream.format("kafka") .options(kafkaParams) .load() .selectExpr("CAST(value AS STRING) as message") .select(from_json(col("message"), schema).as("json")) .select("json.*") .groupBy(window(col("event_time"), "10 minutes")) .count() .writeStream .foreachBatch { (batch: Dataset[Row], batchId: Long) => println(s"batch : ${batchId}") batch.show(false) } .outputMode("update") .start() query.awaitTermination() }{code} It succeeds for batch 0 but fails for batch 1 with following exception when more data is arrives in the stream. {code:java} 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, Column 18: A method named "putLong" is not declared in any enclosing class nor any supertype, nor through a static import org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, Column 18: A method named "putLong" is not declared in any enclosing class nor any supertype, nor through a static import at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060) at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781) at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760) at org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360) at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487) at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981) at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848) at
[jira] [Updated] (SPARK-26205) Optimize In expression for bytes, shorts, ints
[ https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-26205: - Description: Currently, {{In}} expressions are compiled into a sequence of if-else statements, which results in O\(n\) time complexity. {{InSet}} is an optimized version of {{In}}, which is supposed to improve the performance if the number of elements is big enough. However, {{InSet}} actually degrades the performance in many cases due to various reasons (benchmarks will be available in SPARK-26203 and solutions are discussed in SPARK-26204). The main idea of this JIRA is to make use of {{tableswitch}} and {{lookupswitch}} bytecode instructions. In short, we can improve our time complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java {{switch}} statements. We will have O\(1\) time complexity if our case values are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will give us O\(log n\). An important benefit of the proposed approach is that we do not have to pay an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we can substantially outperform {{InSet}} even on 250+ elements. See [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] and [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] for more information. was: Currently, {{In}} expressions are compiled into a sequence of if-else statements, which results in O\(n\) time complexity. {{InSet}} is an optimized version of {{In}}, which is supposed to improve the performance if the number of elements is big enough. However, {{InSet}} actually degrades the performance in many cases due to various reasons (benchmarks will be available in SPARK-26203 and solutions are discussed in SPARK-26204). The main idea of this JIRA is to make use of {{tableswitch}} and {{lookupswitch}} bytecode instructions. In short, we can improve our time complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java {{switch}} statements. We will have O\(1\) time complexity if our case values are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will give us O\(log n\). An important benefit of the proposed approach is that we do not have to pay an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we can substantially outperform {{InSet}} even on 250+ elements. See [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] and here for [more|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] information. > Optimize In expression for bytes, shorts, ints > -- > > Key: SPARK-26205 > URL: https://issues.apache.org/jira/browse/SPARK-26205 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > Currently, {{In}} expressions are compiled into a sequence of if-else > statements, which results in O\(n\) time complexity. {{InSet}} is an > optimized version of {{In}}, which is supposed to improve the performance if > the number of elements is big enough. However, {{InSet}} actually degrades > the performance in many cases due to various reasons (benchmarks will be > available in SPARK-26203 and solutions are discussed in SPARK-26204). > The main idea of this JIRA is to make use of {{tableswitch}} and > {{lookupswitch}} bytecode instructions. In short, we can improve our time > complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java > {{switch}} statements. We will have O\(1\) time complexity if our case values > are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will > give us O\(log n\). > An important benefit of the proposed approach is that we do not have to pay > an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we > can substantially outperform {{InSet}} even on 250+ elements. > See > [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] > and > [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] > for more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled
[ https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26201: Assignee: Apache Spark > python broadcast.value on driver fails with disk encryption enabled > --- > > Key: SPARK-26201 > URL: https://issues.apache.org/jira/browse/SPARK-26201 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2 >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Major > > I was trying python with rpc and disk encryption enabled and when I tried a > python broadcast variable and just read the value back on the driver side the > job failed with: > > Traceback (most recent call last): File "broadcast.py", line 37, in > words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value > File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File > "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of > input > To reproduce use configs: --conf spark.network.crypto.enabled=true --conf > spark.io.encryption.enabled=true > > Code: > words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"]) > words_new.value > print(words_new.value) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled
[ https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26201: Assignee: (was: Apache Spark) > python broadcast.value on driver fails with disk encryption enabled > --- > > Key: SPARK-26201 > URL: https://issues.apache.org/jira/browse/SPARK-26201 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2 >Reporter: Thomas Graves >Priority: Major > > I was trying python with rpc and disk encryption enabled and when I tried a > python broadcast variable and just read the value back on the driver side the > job failed with: > > Traceback (most recent call last): File "broadcast.py", line 37, in > words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value > File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File > "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of > input > To reproduce use configs: --conf spark.network.crypto.enabled=true --conf > spark.io.encryption.enabled=true > > Code: > words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"]) > words_new.value > print(words_new.value) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26204) Optimize InSet expression
[ https://issues.apache.org/jira/browse/SPARK-26204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-26204: - Description: The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time complexity in the {{In}} expression. As {{InSet}} relies on Scala {{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the performance of {{InSet}} might be significantly slower than {{In}} even on 100+ values. We need to find an approach how to optimize {{InSet}} expressions and avoid the cost of autoboxing. There are a few approaches that we can use: * Collections for primitive values (e.g., FastUtil, HPPC) * Type specialization in Scala (would it even work for code gen in Spark?) I tried to use {{OpenHashSet}}, which is already available in Spark and uses type specialization. However, I did not manage to avoid autoboxing. On the other hand, FastUtil did work and I saw a substantial improvement in the performance. See the attached screenshot of what I experienced while testing. was: The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time complexity in the `In` expression. As {{InSet}} relies on Scala {{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the performance of {{InSet}} might be significantly slower than {{In}} even on 100+ values. We need to find an approach how to optimize {{InSet}} expressions and avoid the cost of autoboxing. There are a few approaches that we can use: * Collections for primitive values (e.g., FastUtil, HPPC) * Type specialization in Scala (would it even work for code gen in Spark?) I tried to use {{OpenHashSet}}, which is already available in Spark and uses type specialization. However, I did not manage to avoid autoboxing. On the other hand, FastUtil did work and I saw a substantial improvement in the performance. See the attached screenshot of what I experienced while testing. > Optimize InSet expression > - > > Key: SPARK-26204 > URL: https://issues.apache.org/jira/browse/SPARK-26204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > Attachments: heap size.png > > > The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time > complexity in the {{In}} expression. As {{InSet}} relies on Scala > {{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the > performance of {{InSet}} might be significantly slower than {{In}} even on > 100+ values. > We need to find an approach how to optimize {{InSet}} expressions and avoid > the cost of autoboxing. > There are a few approaches that we can use: > * Collections for primitive values (e.g., FastUtil, HPPC) > * Type specialization in Scala (would it even work for code gen in Spark?) > I tried to use {{OpenHashSet}}, which is already available in Spark and uses > type specialization. However, I did not manage to avoid autoboxing. On the > other hand, FastUtil did work and I saw a substantial improvement in the > performance. > See the attached screenshot of what I experienced while testing. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26204) Optimize InSet expression
[ https://issues.apache.org/jira/browse/SPARK-26204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-26204: - Attachment: heap size.png > Optimize InSet expression > - > > Key: SPARK-26204 > URL: https://issues.apache.org/jira/browse/SPARK-26204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > Attachments: heap size.png > > > The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time > complexity in the `In` expression. As {{InSet}} relies on Scala > {{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the > performance of {{InSet}} might be significantly slower than {{In}} even on > 100+ values. > We need to find an approach how to optimize {{InSet}} expressions and avoid > the cost of autoboxing. > There are a few approaches that we can use: > * Collections for primitive values (e.g., FastUtil, HPPC) > * Type specialization in Scala (would it even work for code gen in Spark?) > I tried to use {{OpenHashSet}}, which is already available in Spark and uses > type specialization. However, I did not manage to avoid autoboxing. On the > other hand, FastUtil did work and I saw a substantial improvement in the > performance. > See the attached screenshot of what I experienced while testing. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26204) Optimize InSet expression
[ https://issues.apache.org/jira/browse/SPARK-26204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-26204: - Attachment: (was: fastutils.png) > Optimize InSet expression > - > > Key: SPARK-26204 > URL: https://issues.apache.org/jira/browse/SPARK-26204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time > complexity in the `In` expression. As {{InSet}} relies on Scala > {{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the > performance of {{InSet}} might be significantly slower than {{In}} even on > 100+ values. > We need to find an approach how to optimize {{InSet}} expressions and avoid > the cost of autoboxing. > There are a few approaches that we can use: > * Collections for primitive values (e.g., FastUtil, HPPC) > * Type specialization in Scala (would it even work for code gen in Spark?) > I tried to use {{OpenHashSet}}, which is already available in Spark and uses > type specialization. However, I did not manage to avoid autoboxing. On the > other hand, FastUtil did work and I saw a substantial improvement in the > performance. > See the attached screenshot of what I experienced while testing. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26204) Optimize InSet expression
[ https://issues.apache.org/jira/browse/SPARK-26204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-26204: - Attachment: fastutils.png > Optimize InSet expression > - > > Key: SPARK-26204 > URL: https://issues.apache.org/jira/browse/SPARK-26204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > Attachments: fastutils.png > > > The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time > complexity in the `In` expression. As {{InSet}} relies on Scala > {{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the > performance of {{InSet}} might be significantly slower than {{In}} even on > 100+ values. > We need to find an approach how to optimize {{InSet}} expressions and avoid > the cost of autoboxing. > There are a few approaches that we can use: > * Collections for primitive values (e.g., FastUtil, HPPC) > * Type specialization in Scala (would it even work for code gen in Spark?) > I tried to use {{OpenHashSet}}, which is already available in Spark and uses > type specialization. However, I did not manage to avoid autoboxing. On the > other hand, FastUtil did work and I saw a substantial improvement in the > performance. > See the attached screenshot of what I experienced while testing. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26204) Optimize InSet expression
Anton Okolnychyi created SPARK-26204: Summary: Optimize InSet expression Key: SPARK-26204 URL: https://issues.apache.org/jira/browse/SPARK-26204 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Anton Okolnychyi Attachments: fastutils.png The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time complexity in the `In` expression. As {{InSet}} relies on Scala {{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the performance of {{InSet}} might be significantly slower than {{In}} even on 100+ values. We need to find an approach how to optimize {{InSet}} expressions and avoid the cost of autoboxing. There are a few approaches that we can use: * Collections for primitive values (e.g., FastUtil, HPPC) * Type specialization in Scala (would it even work for code gen in Spark?) I tried to use {{OpenHashSet}}, which is already available in Spark and uses type specialization. However, I did not manage to avoid autoboxing. On the other hand, FastUtil did work and I saw a substantial improvement in the performance. See the attached screenshot of what I experienced while testing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-26203: - Description: {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values are literals. This was done for performance reasons to avoid O\(n\) time complexity for {{In}}. The original optimization was done in SPARK-3711. A lot has changed after that (e.g., generation of Java code to evaluate expressions), so it is worth to measure the performance of this optimization again. According to my local benchmarks, {{InSet}} can be up to 10x time slower than {{In}} due to autoboxing and other issues. The scope of this JIRA is to benchmark every supported data type inside {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this information, we can come up with solutions. Based on my preliminary investigation, we can do quite some optimizations, which quite frequently depend on a specific data type. was: {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values are literals. This was done for performance reasons to avoid O(n) time complexity for {{In}}. The original optimization was done in SPARK-3711. A lot has changed after that (e.g., generation of Java code to evaluate expressions), so it is worth to measure the performance of this optimization again. According to my local benchmarks, {{InSet}} can be up to 10x time slower than {{In}} due to autoboxing and other issues. The scope of this JIRA is to benchmark every supported data type inside {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this information, we can come up with solutions. Based on my preliminary investigation, we can do quite some optimizations, which quite frequently depend on a specific data type. > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702096#comment-16702096 ] Apache Spark commented on SPARK-26188: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/23165 > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Priority: Minor > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, > range: 0-662, partition values: [7a] > 18/11/27 14:02:50
[jira] [Assigned] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26188: Assignee: Apache Spark > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Assignee: Apache Spark >Priority: Minor > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, > range: 0-662, partition values: [7a] > 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: >
[jira] [Commented] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702095#comment-16702095 ] Gengliang Wang commented on SPARK-26188: [~ddgirard]Thanks for the investigation. I have created https://github.com/apache/spark/pull/23165 to fix it. > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Priority: Minor > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, > range: 0-662, partition values: [7a] > 18/11/27
[jira] [Assigned] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26188: Assignee: (was: Apache Spark) > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Priority: Minor > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, > range: 0-662, partition values: [7a] > 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: >
[jira] [Updated] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-26203: - Description: {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values are literals. This was done for performance reasons to avoid O(n) time complexity for {{In}}. The original optimization was done in SPARK-3711. A lot has changed after that (e.g., generation of Java code to evaluate expressions), so it is worth to measure the performance of this optimization again. According to my local benchmarks, {{InSet}} can be up to 10x time slower than {{In}} due to autoboxing and other issues. The scope of this JIRA is to benchmark every supported data type inside {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this information, we can come up with solutions. Based on my preliminary investigation, we can do quite some optimizations, which quite frequently depend on a specific data type. was: {{OptimizeIn}} rule that replaces {{In}} with {{InSet}} if the number of possible values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values are literals. This was done for performance reasons to avoid O(n) time complexity for {{In}}. The original optimization was done in SPARK-3711. A lot has changed after that (e.g., generation of Java code to evaluate expressions), so it is worth to measure the performance of this optimization again. According to my local benchmarks, {{InSet}} can be up to 10x time slower than {{In}} due to autoboxing and other issues. The scope of this JIRA is to benchmark every supported data type inside {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this information, we can come up with solutions. Based on my preliminary investigation, we can do quite some optimizations, which quite frequently depend on a specific data type. > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O(n) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26203) Benchmark performance of In and InSet expressions
Anton Okolnychyi created SPARK-26203: Summary: Benchmark performance of In and InSet expressions Key: SPARK-26203 URL: https://issues.apache.org/jira/browse/SPARK-26203 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Anton Okolnychyi {{OptimizeIn}} rule that replaces {{In}} with {{InSet}} if the number of possible values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values are literals. This was done for performance reasons to avoid O(n) time complexity for {{In}}. The original optimization was done in SPARK-3711. A lot has changed after that (e.g., generation of Java code to evaluate expressions), so it is worth to measure the performance of this optimization again. According to my local benchmarks, {{InSet}} can be up to 10x time slower than {{In}} due to autoboxing and other issues. The scope of this JIRA is to benchmark every supported data type inside {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this information, we can come up with solutions. Based on my preliminary investigation, we can do quite some optimizations, which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26202) R bucketBy
Huaxin Gao created SPARK-26202: -- Summary: R bucketBy Key: SPARK-26202 URL: https://issues.apache.org/jira/browse/SPARK-26202 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 3.0.0 Reporter: Huaxin Gao Add R version of DataFrameWriter.bucketBy -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21291) R partitionBy API
[ https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-21291: --- Summary: R partitionBy API (was: R bucketBy partitionBy API) > R partitionBy API > - > > Key: SPARK-21291 > URL: https://issues.apache.org/jira/browse/SPARK-21291 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > partitionBy exists but it's for windowspec only -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25829) remove duplicated map keys with last wins policy
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25829: Summary: remove duplicated map keys with last wins policy (was: Duplicated map keys are not handled consistently) > remove duplicated map keys with last wins policy > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25829. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23124 [https://github.com/apache/spark/pull/23124] > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled
[ https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702046#comment-16702046 ] Sanket Reddy commented on SPARK-26201: -- Thanks [~tgraves] will put up the patch shortly > python broadcast.value on driver fails with disk encryption enabled > --- > > Key: SPARK-26201 > URL: https://issues.apache.org/jira/browse/SPARK-26201 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2 >Reporter: Thomas Graves >Priority: Major > > I was trying python with rpc and disk encryption enabled and when I tried a > python broadcast variable and just read the value back on the driver side the > job failed with: > > Traceback (most recent call last): File "broadcast.py", line 37, in > words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value > File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File > "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of > input > To reproduce use configs: --conf spark.network.crypto.enabled=true --conf > spark.io.encryption.enabled=true > > Code: > words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"]) > words_new.value > print(words_new.value) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25831) should apply "earlier entry wins" in hive map value converter
[ https://issues.apache.org/jira/browse/SPARK-25831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25831. - Resolution: Invalid > should apply "earlier entry wins" in hive map value converter > - > > Key: SPARK-25831 > URL: https://issues.apache.org/jira/browse/SPARK-25831 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > {code} > scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map") > res11: org.apache.spark.sql.DataFrame = [] > scala> sql("select * from t").show > ++ > | map| > ++ > |[1 -> 3]| > ++ > {code} > We mistakenly apply "later entry wins" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25824) Remove duplicated map entries in `showString`
[ https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25824. - Resolution: Fixed > Remove duplicated map entries in `showString` > - > > Key: SPARK-25824 > URL: https://issues.apache.org/jira/browse/SPARK-25824 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `showString` doesn't eliminate the duplication. So, it looks different from > the result of `collect` and select from saved rows. > *Spark 2.2.2* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("SELECT map(1,2,1,3)").show > +---+ > |map(1, 2, 1, 3)| > +---+ > |Map(1 -> 3)| > +---+ > {code} > *Spark 2.3.0 ~ 2.4.0-rc4* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a") > scala> sql("SELECT * FROM m").show > ++ > | a| > ++ > |[1 -> 3]| > ++ > scala> sql("SELECT map(1,2,1,3)").show > ++ > | map(1, 2, 1, 3)| > ++ > |[1 -> 2, 1 -> 3]| > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25830) should apply "earlier entry wins" in Dataset.collect
[ https://issues.apache.org/jira/browse/SPARK-25830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25830. - Resolution: Invalid > should apply "earlier entry wins" in Dataset.collect > > > Key: SPARK-25830 > URL: https://issues.apache.org/jira/browse/SPARK-25830 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > {code} > scala> sql("select map(1,2,1,3)").collect > res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > {code} > We mistakenly apply "later entry wins" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25829: --- Assignee: Wenchen Fan > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled
[ https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702028#comment-16702028 ] Thomas Graves commented on SPARK-26201: --- the issue here seems to be that it isn't decrypting the file before trying to read it, we will have a patch up for this shortly. > python broadcast.value on driver fails with disk encryption enabled > --- > > Key: SPARK-26201 > URL: https://issues.apache.org/jira/browse/SPARK-26201 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2 >Reporter: Thomas Graves >Priority: Major > > I was trying python with rpc and disk encryption enabled and when I tried a > python broadcast variable and just read the value back on the driver side the > job failed with: > > Traceback (most recent call last): File "broadcast.py", line 37, in > words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value > File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File > "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of > input > To reproduce use configs: --conf spark.network.crypto.enabled=true --conf > spark.io.encryption.enabled=true > > Code: > words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"]) > words_new.value > print(words_new.value) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly
[ https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25989. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22991 [https://github.com/apache/spark/pull/22991] > OneVsRestModel handle empty outputCols incorrectly > -- > > Key: SPARK-25989 > URL: https://issues.apache.org/jira/browse/SPARK-25989 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > {\{ml.classification.ClassificationModel}} will ignore empty output columns. > However, \{{OneVsRestModel}} still try to append new column even if its name > is an empty string. > {code:java} > scala> ovrModel.setPredictionCol("").transform(test).show > +-+++---+ > |label| features| rawPrediction| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > scala> > ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show > +-+++---+ > |label| features| raw| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly
[ https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25989: - Assignee: zhengruifeng > OneVsRestModel handle empty outputCols incorrectly > -- > > Key: SPARK-25989 > URL: https://issues.apache.org/jira/browse/SPARK-25989 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > {\{ml.classification.ClassificationModel}} will ignore empty output columns. > However, \{{OneVsRestModel}} still try to append new column even if its name > is an empty string. > {code:java} > scala> ovrModel.setPredictionCol("").transform(test).show > +-+++---+ > |label| features| rawPrediction| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > scala> > ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show > +-+++---+ > |label| features| raw| | > +-+++---+ > | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0| > | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0| > | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0| > | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0| > | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0| > | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0| > | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0| > | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0| > | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0| > +-+++---+ > only showing top 20 rows > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled
Thomas Graves created SPARK-26201: - Summary: python broadcast.value on driver fails with disk encryption enabled Key: SPARK-26201 URL: https://issues.apache.org/jira/browse/SPARK-26201 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.2 Reporter: Thomas Graves I was trying python with rpc and disk encryption enabled and when I tried a python broadcast variable and just read the value back on the driver side the job failed with: Traceback (most recent call last): File "broadcast.py", line 37, in words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of input To reproduce use configs: --conf spark.network.crypto.enabled=true --conf spark.io.encryption.enabled=true Code: words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"]) words_new.value print(words_new.value) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object
[ https://issues.apache.org/jira/browse/SPARK-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25998. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22995 [https://github.com/apache/spark/pull/22995] > TorrentBroadcast holds strong reference to broadcast object > --- > > Key: SPARK-25998 > URL: https://issues.apache.org/jira/browse/SPARK-25998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Brandon Krieger >Assignee: Brandon Krieger >Priority: Major > Fix For: 3.0.0 > > > If we do a large number of broadcast joins while holding onto the Dataset > reference, it will hold onto a large amount of memory for the value of the > broadcast object. The broadcast object is also held in the MemoryStore, but > that will clean itself up to prevent its memory usage from going over a > certain level. In my use case, I don't want to release the reference to the > Dataset (which would allow the broadcast object to be GCed) because I want to > be able to unpersist it at some point in the future (when it is no longer > relevant). > See the following repro in Spark shell: > {code:java} > import org.apache.spark.sql.functions._ > import org.apache.spark.SparkEnv > val startDf = (1 to 100).toDF("num").withColumn("num", > $"num".cast("string")).cache() > val leftDf = startDf.withColumn("num", concat($"num", lit("0"))) > val rightDf = startDf.withColumn("num", concat($"num", lit("1"))) > val broadcastJoinedDf = leftDf.join(broadcast(rightDf), > leftDf.col("num").eqNullSafe(rightDf.col("num"))) > broadcastJoinedDf.count > // Take a heap dump, see UnsafeHashedRelation with hard references in > MemoryStore and Dataset > // Force the MemoryStore to clear itself > SparkEnv.get.blockManager.stop > // Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now > referenced only by the Dataset. > {code} > If we make the TorrentBroadcast hold a weak reference to the broadcast > object, the second heap dump will show nothing; the UnsafeHashedRelation has > been GCed. > Given that the broadcast object can be reloaded from the MemoryStore, it > seems like it would be alright to use a WeakReference instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object
[ https://issues.apache.org/jira/browse/SPARK-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25998: - Assignee: Brandon Krieger > TorrentBroadcast holds strong reference to broadcast object > --- > > Key: SPARK-25998 > URL: https://issues.apache.org/jira/browse/SPARK-25998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Brandon Krieger >Assignee: Brandon Krieger >Priority: Major > Fix For: 3.0.0 > > > If we do a large number of broadcast joins while holding onto the Dataset > reference, it will hold onto a large amount of memory for the value of the > broadcast object. The broadcast object is also held in the MemoryStore, but > that will clean itself up to prevent its memory usage from going over a > certain level. In my use case, I don't want to release the reference to the > Dataset (which would allow the broadcast object to be GCed) because I want to > be able to unpersist it at some point in the future (when it is no longer > relevant). > See the following repro in Spark shell: > {code:java} > import org.apache.spark.sql.functions._ > import org.apache.spark.SparkEnv > val startDf = (1 to 100).toDF("num").withColumn("num", > $"num".cast("string")).cache() > val leftDf = startDf.withColumn("num", concat($"num", lit("0"))) > val rightDf = startDf.withColumn("num", concat($"num", lit("1"))) > val broadcastJoinedDf = leftDf.join(broadcast(rightDf), > leftDf.col("num").eqNullSafe(rightDf.col("num"))) > broadcastJoinedDf.count > // Take a heap dump, see UnsafeHashedRelation with hard references in > MemoryStore and Dataset > // Force the MemoryStore to clear itself > SparkEnv.get.blockManager.stop > // Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now > referenced only by the Dataset. > {code} > If we make the TorrentBroadcast hold a weak reference to the broadcast > object, the second heap dump will show nothing; the UnsafeHashedRelation has > been GCed. > Given that the broadcast object can be reloaded from the MemoryStore, it > seems like it would be alright to use a WeakReference instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26137) Linux file separator is hard coded in DependencyUtils used in deploy process
[ https://issues.apache.org/jira/browse/SPARK-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26137. --- Resolution: Fixed Fix Version/s: 2.4.1 3.0.0 2.3.3 Issue resolved by pull request 23102 [https://github.com/apache/spark/pull/23102] > Linux file separator is hard coded in DependencyUtils used in deploy process > > > Key: SPARK-26137 > URL: https://issues.apache.org/jira/browse/SPARK-26137 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Mark Pavey >Assignee: Mark Pavey >Priority: Major > Fix For: 2.3.3, 3.0.0, 2.4.1 > > > During deployment, while downloading dependencies the code tries to remove > multiple copies of the application jar from the driver classpath. The Linux > file separator ("/") is hard coded here so on Windows multiple copies of the > jar are not removed. > This has a knock on effect when trying to use elasticsearch-spark as this > library does not run if there are multiple copies of the application jar on > the driver classpath. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26137) Linux file separator is hard coded in DependencyUtils used in deploy process
[ https://issues.apache.org/jira/browse/SPARK-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26137: - Assignee: Mark Pavey > Linux file separator is hard coded in DependencyUtils used in deploy process > > > Key: SPARK-26137 > URL: https://issues.apache.org/jira/browse/SPARK-26137 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Mark Pavey >Assignee: Mark Pavey >Priority: Major > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > During deployment, while downloading dependencies the code tries to remove > multiple copies of the application jar from the driver classpath. The Linux > file separator ("/") is hard coded here so on Windows multiple copies of the > jar are not removed. > This has a knock on effect when trying to use elasticsearch-spark as this > library does not run if there are multiple copies of the application jar on > the driver classpath. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26198) Metadata serialize null values throw NPE
[ https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-26198: Description: How to reproduce this issue: {code} scala> val meta = new org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json java.lang.NullPointerException at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196) at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180) {code} was: How to reproduce this issue: {code:scala} scala> val meta = new org.apache.spark.sql.types.MetadataBuilder().putNull("key").build() java.lang.NullPointerException at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196) at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180) {code} > Metadata serialize null values throw NPE > > > Key: SPARK-26198 > URL: https://issues.apache.org/jira/browse/SPARK-26198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code} > scala> val meta = new > org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json > java.lang.NullPointerException > at > org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196) > at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701964#comment-16701964 ] Yang Jie commented on SPARK-26155: -- [~Jk_Self] Can you try to add `-XX:+UseSuperWord` to spark.executor.extraJavaOptions to confirm whether L486&487 break the SIMD optimization of jvm. > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486&487.pdf, q19.sql > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization
David Lyness created SPARK-26200: Summary: Column values are incorrectly transposed when a field in a PySpark Row requires serialization Key: SPARK-26200 URL: https://issues.apache.org/jira/browse/SPARK-26200 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Environment: {noformat} __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144 Branch Compiled by user on 2018-10-29T06:22:05Z{noformat} The same issue is observed when PySpark is run on both macOS 10.13.6 and CentOS 7, so this appears to be a cross-platform issue. Reporter: David Lyness h2. Description of issue Whenever a field in a PySpark {{Row}} requires serialization (such as a {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below will assign column values *in alphabetical order*, rather than assigning each column value to its specified columns. h3. Code to reproduce: {code:java} import datetime from pyspark.sql import Row from pyspark.sql.session import SparkSession from pyspark.sql.types import DateType, StringType, StructField, StructType spark = SparkSession.builder.getOrCreate() schema = StructType([ StructField("date_column", DateType()), StructField("my_b_column", StringType()), StructField("my_a_column", StringType()), ]) spark.createDataFrame([Row( date_column=datetime.date.today(), my_b_column="my_b_value", my_a_column="my_a_value" )], schema).show() {code} h3. Expected result: {noformat} +---+---+---+ |date_column|my_b_column|my_a_column| +---+---+---+ | 2018-11-28| my_b_value| my_a_value| +---+---+---+{noformat} h3. Actual result: {noformat} +---+---+---+ |date_column|my_b_column|my_a_column| +---+---+---+ | 2018-11-28| my_a_value| my_b_value| +---+---+---+{noformat} (Note that {{my_a_value}} and {{my_b_value}} are transposed.) h2. Analysis of issue Reviewing [the relevant code on GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622], there are two relevant conditional blocks: {code:java} if self._needSerializeAnyField: # Block 1, does not work correctly else: # Block 2, works correctly {code} {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and a dictionary of named columns. In Block 2, there is a conditional that works specifically to serialize a {{Row}} object: {code:java} elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): return tuple(obj[n] for n in self.names) {code} There is no such condition in Block 1, so we fall into this instead: {code:java} elif isinstance(obj, (tuple, list)): return tuple(f.toInternal(v) if c else v for f, v, c in zip(self.fields, obj, self._needConversion)) {code} The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will return a different ordering than the schema fields. So we end up with: {code:java} (date, date, True), (b, a, False), (a, b, False) {code} h2. Workarounds Correct behaviour is observed if you use a Python {{list}} or {{dict}} instead of PySpark's {{Row}} object: {code:java} # Using a list works spark.createDataFrame([[ datetime.date.today(), "my_b_value", "my_a_value" ]], schema) # Using a dict also works spark.createDataFrame([{ "date_column": datetime.date.today(), "my_b_column": "my_b_value", "my_a_column": "my_a_value" }], schema){code} Correct behaviour is also observed if you have no fields that require serialization; in this example, changing {{date_column}} to {{StringType}} avoids the correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization
[ https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lyness updated SPARK-26200: - Environment: Spark version 2.4.0 Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144 The same issue is observed when PySpark is run on both macOS 10.13.6 and CentOS 7, so this appears to be a cross-platform issue. was: {noformat} __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144 Branch Compiled by user on 2018-10-29T06:22:05Z{noformat} The same issue is observed when PySpark is run on both macOS 10.13.6 and CentOS 7, so this appears to be a cross-platform issue. > Column values are incorrectly transposed when a field in a PySpark Row > requires serialization > - > > Key: SPARK-26200 > URL: https://issues.apache.org/jira/browse/SPARK-26200 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Spark version 2.4.0 > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144 > The same issue is observed when PySpark is run on both macOS 10.13.6 and > CentOS 7, so this appears to be a cross-platform issue. >Reporter: David Lyness >Priority: Major > Labels: correctness > > h2. Description of issue > Whenever a field in a PySpark {{Row}} requires serialization (such as a > {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below > will assign column values *in alphabetical order*, rather than assigning each > column value to its specified columns. > h3. Code to reproduce: > {code:java} > import datetime > from pyspark.sql import Row > from pyspark.sql.session import SparkSession > from pyspark.sql.types import DateType, StringType, StructField, StructType > spark = SparkSession.builder.getOrCreate() > schema = StructType([ > StructField("date_column", DateType()), > StructField("my_b_column", StringType()), > StructField("my_a_column", StringType()), > ]) > spark.createDataFrame([Row( > date_column=datetime.date.today(), > my_b_column="my_b_value", > my_a_column="my_a_value" > )], schema).show() > {code} > h3. Expected result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_b_value| my_a_value| > +---+---+---+{noformat} > h3. Actual result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_a_value| my_b_value| > +---+---+---+{noformat} > (Note that {{my_a_value}} and {{my_b_value}} are transposed.) > h2. Analysis of issue > Reviewing [the relevant code on > GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622], > there are two relevant conditional blocks: > > {code:java} > if self._needSerializeAnyField: > # Block 1, does not work correctly > else: > # Block 2, works correctly > {code} > {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and > a dictionary of named columns. In Block 2, there is a conditional that works > specifically to serialize a {{Row}} object: > > {code:java} > elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): > return tuple(obj[n] for n in self.names) > {code} > There is no such condition in Block 1, so we fall into this instead: > > {code:java} > elif isinstance(obj, (tuple, list)): > return tuple(f.toInternal(v) if c else v > for f, v, c in zip(self.fields, obj, self._needConversion)) > {code} > The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will > return a different ordering than the schema fields. So we end up with: > {code:java} > (date, date, True), > (b, a, False), > (a, b, False) > {code} > h2. Workarounds > Correct behaviour is observed if you use a Python {{list}} or {{dict}} > instead of PySpark's {{Row}} object: > > {code:java} > # Using a list works > spark.createDataFrame([[ > datetime.date.today(), > "my_b_value", > "my_a_value" > ]], schema) > # Using a dict also works > spark.createDataFrame([{ > "date_column": datetime.date.today(), > "my_b_column": "my_b_value", > "my_a_column": "my_a_value" > }], schema){code} > Correct behaviour is also observed if you have no fields that require > serialization; in this example, changing {{date_column}} to {{StringType}} > avoids the correctness issue. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SPARK-26147) Python UDFs in join condition fail even when using columns from only one side of join
[ https://issues.apache.org/jira/browse/SPARK-26147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26147. - Resolution: Fixed Fix Version/s: 2.4.1 3.0.0 Issue resolved by pull request 23153 [https://github.com/apache/spark/pull/23153] > Python UDFs in join condition fail even when using columns from only one side > of join > - > > Key: SPARK-26147 > URL: https://issues.apache.org/jira/browse/SPARK-26147 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Ala Luszczak >Priority: Major > Fix For: 3.0.0, 2.4.1 > > > The rule {{PullOutPythonUDFInJoinCondition}} was implemented in > [https://github.com/apache/spark/commit/2a8cbfddba2a59d144b32910c68c22d0199093fe] > As far as I understand, this rule was intended to prevent the use of Python > UDFs in join condition if they take arguments from both sides of the join, > and this doesn't make sense in combination with the join type. > The rule {{PullOutPythonUDFInJoinCondition}} seems to make an assumption, > that if a given UDF is only using columns from a single side of the join, it > will be already pushed down under the join before this rule is executed. > However, this is not always the case. Here's a simple example that fails, > even though it looks like it should run just fine (and it does in earlier > versions of Spark): > {code:java} > from pyspark.sql import Row > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > cars_list = [ Row("NL", "1234AB"), Row("UK", "987654") ] > insurance_list = [ Row("NL-1234AB"), Row("BE-112233") ] > spark.createDataFrame(data = cars_list, schema = ["country", > "plate_nr"]).createOrReplaceTempView("cars") > spark.createDataFrame(data = insurance_list, schema = > ["insurance_code"]).createOrReplaceTempView("insurance") > to_insurance_code = udf(lambda x, y: x + "-" + y, StringType()) > sqlContext.udf.register('to_insurance_code', to_insurance_code) > spark.conf.set("spark.sql.crossJoin.enabled", "true") > # This query runs just fine. > sql(""" > SELECT country, plate_nr, insurance_code > FROM cars LEFT OUTER JOIN insurance > ON CONCAT(country, '-', plate_nr) = insurance_code > """).show() > # This equivalent query fails with: > # pyspark.sql.utils.AnalysisException: u'Using PythonUDF in join condition of > join type LeftOuter is not supported.;' > sql(""" > SELECT country, plate_nr, insurance_code > FROM cars LEFT OUTER JOIN insurance > ON to_insurance_code(country, plate_nr) = insurance_code > """).show() > {code} > [~cloud_fan] [~XuanYuan] fyi -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26147) Python UDFs in join condition fail even when using columns from only one side of join
[ https://issues.apache.org/jira/browse/SPARK-26147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26147: --- Assignee: Wenchen Fan > Python UDFs in join condition fail even when using columns from only one side > of join > - > > Key: SPARK-26147 > URL: https://issues.apache.org/jira/browse/SPARK-26147 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Ala Luszczak >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > The rule {{PullOutPythonUDFInJoinCondition}} was implemented in > [https://github.com/apache/spark/commit/2a8cbfddba2a59d144b32910c68c22d0199093fe] > As far as I understand, this rule was intended to prevent the use of Python > UDFs in join condition if they take arguments from both sides of the join, > and this doesn't make sense in combination with the join type. > The rule {{PullOutPythonUDFInJoinCondition}} seems to make an assumption, > that if a given UDF is only using columns from a single side of the join, it > will be already pushed down under the join before this rule is executed. > However, this is not always the case. Here's a simple example that fails, > even though it looks like it should run just fine (and it does in earlier > versions of Spark): > {code:java} > from pyspark.sql import Row > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > cars_list = [ Row("NL", "1234AB"), Row("UK", "987654") ] > insurance_list = [ Row("NL-1234AB"), Row("BE-112233") ] > spark.createDataFrame(data = cars_list, schema = ["country", > "plate_nr"]).createOrReplaceTempView("cars") > spark.createDataFrame(data = insurance_list, schema = > ["insurance_code"]).createOrReplaceTempView("insurance") > to_insurance_code = udf(lambda x, y: x + "-" + y, StringType()) > sqlContext.udf.register('to_insurance_code', to_insurance_code) > spark.conf.set("spark.sql.crossJoin.enabled", "true") > # This query runs just fine. > sql(""" > SELECT country, plate_nr, insurance_code > FROM cars LEFT OUTER JOIN insurance > ON CONCAT(country, '-', plate_nr) = insurance_code > """).show() > # This equivalent query fails with: > # pyspark.sql.utils.AnalysisException: u'Using PythonUDF in join condition of > join type LeftOuter is not supported.;' > sql(""" > SELECT country, plate_nr, insurance_code > FROM cars LEFT OUTER JOIN insurance > ON to_insurance_code(country, plate_nr) = insurance_code > """).show() > {code} > [~cloud_fan] [~XuanYuan] fyi -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26198) Metadata serialize null values throw NPE
[ https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701818#comment-16701818 ] Apache Spark commented on SPARK-26198: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/23164 > Metadata serialize null values throw NPE > > > Key: SPARK-26198 > URL: https://issues.apache.org/jira/browse/SPARK-26198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26199) Long expressions cause mutate to fail
João Rafael created SPARK-26199: --- Summary: Long expressions cause mutate to fail Key: SPARK-26199 URL: https://issues.apache.org/jira/browse/SPARK-26199 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: João Rafael Calling {{mutate(df, field = expr)}} fails when expr is very long. Example: {code:R} df <- mutate(df, field = ifelse( lit(TRUE), lit("A"), ifelse( lit(T), lit("BB"), lit("C") ) )) {code} Stack trace: {code:R} FATAL subscript out of bounds at .handleSimpleError(function (obj) { level = sapply(class(obj), sw at FUN(X[[i]], ...) at lapply(seq_along(args), function(i) { if (ns[[i]] != "") { at lapply(seq_along(args), function(i) { if (ns[[i]] != "") { at mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T), lit("BBB at #78: mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T {code} The root cause is in: [DataFrame.R#LL2182|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2182] When the expression is long {{deparse}} returns multiple lines, causing {{args}} to have more elements than {{ns}}. The solution could be to set {{nlines = 1}} or to collapse the lines together. A simple work around exists, by first placing the expression in a variable and using it instead: {code:R} tmp <- ifelse( lit(TRUE), lit("A"), ifelse( lit(T), lit("BB"), lit("C") ) ) df <- mutate(df, field = tmp) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26198) Metadata serialize null values throw NPE
[ https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-26198: Description: How to reproduce this issue: {code:scala} scala> val meta = new org.apache.spark.sql.types.MetadataBuilder().putNull("key").build() java.lang.NullPointerException at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196) at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180) {code} > Metadata serialize null values throw NPE > > > Key: SPARK-26198 > URL: https://issues.apache.org/jira/browse/SPARK-26198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:scala} > scala> val meta = new > org.apache.spark.sql.types.MetadataBuilder().putNull("key").build() > java.lang.NullPointerException > at > org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196) > at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26198) Metadata serialize null values throw NPE
[ https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26198: Assignee: (was: Apache Spark) > Metadata serialize null values throw NPE > > > Key: SPARK-26198 > URL: https://issues.apache.org/jira/browse/SPARK-26198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26198) Metadata serialize null values throw NPE
[ https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26198: Assignee: Apache Spark > Metadata serialize null values throw NPE > > > Key: SPARK-26198 > URL: https://issues.apache.org/jira/browse/SPARK-26198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26114: --- Assignee: Sergey Zhemzhitsky > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > > Key: SPARK-26114 > URL: https://issues.apache.org/jira/browse/SPARK-26114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Spark 3.0.0-SNAPSHOT (master branch) > Scala 2.11 > Yarn 2.7 >Reporter: Sergey Zhemzhitsky >Assignee: Sergey Zhemzhitsky >Priority: Major > Fix For: 2.4.1, 3.0.0 > > Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, > run1-noparams-dominator-tree-externalsorter.png, > run1-noparams-dominator-tree.png > > > Trying to use _coalesce_ after shuffle-oriented transformations leads to > OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X > GB of Y GB physical memory used. Consider > boostingspark.yarn.executor.memoryOverhead_. > Discussion is > [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. > The error happens when trying specify pretty small number of partitions in > _coalesce_ call. > *How to reproduce?* > # Start spark-shell > {code:bash} > spark-shell \ > --num-executors=5 \ > --executor-cores=2 \ > --master=yarn \ > --deploy-mode=client \ > --conf spark.executor.memoryOverhead=512 \ > --conf spark.executor.memory=1g \ > --conf spark.dynamicAllocation.enabled=false \ > --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' > {code} > Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap > memory usage seems to be the only way to control the amount of memory used > for shuffle data transferring by now. > Also note that the total number of cores allocated for job is 5x2=10 > # Then generate some test data > {code:scala} > import org.apache.hadoop.io._ > import org.apache.hadoop.io.compress._ > import org.apache.commons.lang._ > import org.apache.spark._ > // generate 100M records of sample data > sc.makeRDD(1 to 1000, 1000) > .flatMap(item => (1 to 10) > .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) > -> new Text(RandomStringUtils.randomAlphanumeric(1024 > .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) > {code} > # Run the sample job > {code:scala} > import org.apache.hadoop.io._ > import org.apache.spark._ > import org.apache.spark.storage._ > val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) > rdd > .map(item => item._1.toString -> item._2.toString) > .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) > .coalesce(10,false) > .count > {code} > Note that the number of partitions is equal to the total number of cores > allocated to the job. > Here is dominator tree from the heapdump > !run1-noparams-dominator-tree.png|width=700! > 4 instances of ExternalSorter, although there are only 2 concurrently running > tasks per executor. > !run1-noparams-dominator-tree-externalsorter.png|width=700! > And paths to GC root of the already stopped ExternalSorter. > !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26114. - Resolution: Fixed Fix Version/s: 2.4.1 3.0.0 Issue resolved by pull request 23083 [https://github.com/apache/spark/pull/23083] > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > > Key: SPARK-26114 > URL: https://issues.apache.org/jira/browse/SPARK-26114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Spark 3.0.0-SNAPSHOT (master branch) > Scala 2.11 > Yarn 2.7 >Reporter: Sergey Zhemzhitsky >Assignee: Sergey Zhemzhitsky >Priority: Major > Fix For: 3.0.0, 2.4.1 > > Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, > run1-noparams-dominator-tree-externalsorter.png, > run1-noparams-dominator-tree.png > > > Trying to use _coalesce_ after shuffle-oriented transformations leads to > OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X > GB of Y GB physical memory used. Consider > boostingspark.yarn.executor.memoryOverhead_. > Discussion is > [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. > The error happens when trying specify pretty small number of partitions in > _coalesce_ call. > *How to reproduce?* > # Start spark-shell > {code:bash} > spark-shell \ > --num-executors=5 \ > --executor-cores=2 \ > --master=yarn \ > --deploy-mode=client \ > --conf spark.executor.memoryOverhead=512 \ > --conf spark.executor.memory=1g \ > --conf spark.dynamicAllocation.enabled=false \ > --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' > {code} > Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap > memory usage seems to be the only way to control the amount of memory used > for shuffle data transferring by now. > Also note that the total number of cores allocated for job is 5x2=10 > # Then generate some test data > {code:scala} > import org.apache.hadoop.io._ > import org.apache.hadoop.io.compress._ > import org.apache.commons.lang._ > import org.apache.spark._ > // generate 100M records of sample data > sc.makeRDD(1 to 1000, 1000) > .flatMap(item => (1 to 10) > .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) > -> new Text(RandomStringUtils.randomAlphanumeric(1024 > .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) > {code} > # Run the sample job > {code:scala} > import org.apache.hadoop.io._ > import org.apache.spark._ > import org.apache.spark.storage._ > val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) > rdd > .map(item => item._1.toString -> item._2.toString) > .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) > .coalesce(10,false) > .count > {code} > Note that the number of partitions is equal to the total number of cores > allocated to the job. > Here is dominator tree from the heapdump > !run1-noparams-dominator-tree.png|width=700! > 4 instances of ExternalSorter, although there are only 2 concurrently running > tasks per executor. > !run1-noparams-dominator-tree-externalsorter.png|width=700! > And paths to GC root of the already stopped ExternalSorter. > !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26198) Metadata serialize null values throw NPE
[ https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-26198: Summary: Metadata serialize null values throw NPE (was: Metadata serialize null value throw NPE) > Metadata serialize null values throw NPE > > > Key: SPARK-26198 > URL: https://issues.apache.org/jira/browse/SPARK-26198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26198) Metadata serialize null value throw NPE
Yuming Wang created SPARK-26198: --- Summary: Metadata serialize null value throw NPE Key: SPARK-26198 URL: https://issues.apache.org/jira/browse/SPARK-26198 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org