[jira] [Commented] (SPARK-26059) Spark standalone mode, does not correctly record a failed Spark Job.

2018-11-28 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702809#comment-16702809
 ] 

Prashant Sharma commented on SPARK-26059:
-

This will be a won't fix, as to fix it, 

1) One would have to make a rest call on a failure of job, when inside client 
mode. Since the outcome of the rest call itself is not guaranteed, it can never 
be assured that the correct stated of failed application will always be 
reported.
2) A lot of things have to change. For example, currently standalone master 
does not track the progress of a job submitted in client mode. Introducing a 
new rest call, which allows marking the status of the Job, in client mode, etc.


> Spark standalone mode, does not correctly record a failed Spark Job.
> 
>
> Key: SPARK-26059
> URL: https://issues.apache.org/jira/browse/SPARK-26059
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> In order to reproduce submit a failing job to spark standalone master. The 
> status for the failed job is shown as FINISHED, irrespective of the fact it 
> failed or succeeded. 
> EDIT: It happens only when deploy-mode is client, and when deploy mode is 
> cluster it works as expected.
> - Reported by: Surbhi Bakhtiyar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26059) Spark standalone mode, does not correctly record a failed Spark Job.

2018-11-28 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702809#comment-16702809
 ] 

Prashant Sharma edited comment on SPARK-26059 at 11/29/18 7:33 AM:
---

This will be a won't fix, as to fix it, 

1) One would have to make a rest call on a failure of job i.e. on catch of an 
exception. This bug is only applicable to client mode of standalone deploy 
mode. Since the outcome of the rest call itself is not guaranteed, it can never 
be assured that the correct state of failed application will always be reported.
2) A lot of things have to change. For example, currently standalone master 
does not track the progress of a job submitted in client mode. Introducing a 
new rest call, which allows marking the status of the Job, in client mode, etc.



was (Author: prashant_):
This will be a won't fix, as to fix it, 

1) One would have to make a rest call on a failure of job, when inside client 
mode. Since the outcome of the rest call itself is not guaranteed, it can never 
be assured that the correct stated of failed application will always be 
reported.
2) A lot of things have to change. For example, currently standalone master 
does not track the progress of a job submitted in client mode. Introducing a 
new rest call, which allows marking the status of the Job, in client mode, etc.


> Spark standalone mode, does not correctly record a failed Spark Job.
> 
>
> Key: SPARK-26059
> URL: https://issues.apache.org/jira/browse/SPARK-26059
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> In order to reproduce submit a failing job to spark standalone master. The 
> status for the failed job is shown as FINISHED, irrespective of the fact it 
> failed or succeeded. 
> EDIT: It happens only when deploy-mode is client, and when deploy mode is 
> cluster it works as expected.
> - Reported by: Surbhi Bakhtiyar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26059) Spark standalone mode, does not correctly record a failed Spark Job.

2018-11-28 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-26059.
-
Resolution: Won't Fix

> Spark standalone mode, does not correctly record a failed Spark Job.
> 
>
> Key: SPARK-26059
> URL: https://issues.apache.org/jira/browse/SPARK-26059
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> In order to reproduce submit a failing job to spark standalone master. The 
> status for the failed job is shown as FINISHED, irrespective of the fact it 
> failed or succeeded. 
> EDIT: It happens only when deploy-mode is client, and when deploy mode is 
> cluster it works as expected.
> - Reported by: Surbhi Bakhtiyar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26211) Fix InSet for binary, and struct and array with null.

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26211:


Assignee: (was: Apache Spark)

> Fix InSet for binary, and struct and array with null.
> -
>
> Key: SPARK-26211
> URL: https://issues.apache.org/jira/browse/SPARK-26211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently {{InSet}} doesn't work properly for binary type, or struct and 
> array type with null value in the set.
>  Because, as for binary type, the {{HashSet}} doesn't work properly for 
> {{Array[Byte]}}, and as for struct and array type with null value in the set, 
> the {{ordering}} will throw a {{NPE}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26211) Fix InSet for binary, and struct and array with null.

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26211:


Assignee: Apache Spark

> Fix InSet for binary, and struct and array with null.
> -
>
> Key: SPARK-26211
> URL: https://issues.apache.org/jira/browse/SPARK-26211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Currently {{InSet}} doesn't work properly for binary type, or struct and 
> array type with null value in the set.
>  Because, as for binary type, the {{HashSet}} doesn't work properly for 
> {{Array[Byte]}}, and as for struct and array type with null value in the set, 
> the {{ordering}} will throw a {{NPE}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26211) Fix InSet for binary, and struct and array with null.

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702801#comment-16702801
 ] 

Apache Spark commented on SPARK-26211:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23176

> Fix InSet for binary, and struct and array with null.
> -
>
> Key: SPARK-26211
> URL: https://issues.apache.org/jira/browse/SPARK-26211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently {{InSet}} doesn't work properly for binary type, or struct and 
> array type with null value in the set.
>  Because, as for binary type, the {{HashSet}} doesn't work properly for 
> {{Array[Byte]}}, and as for struct and array type with null value in the set, 
> the {{ordering}} will throw a {{NPE}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26211) Fix InSet for binary, and struct and array with null.

2018-11-28 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26211:
-

 Summary: Fix InSet for binary, and struct and array with null.
 Key: SPARK-26211
 URL: https://issues.apache.org/jira/browse/SPARK-26211
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.2
Reporter: Takuya Ueshin


Currently {{InSet}} doesn't work properly for binary type, or struct and array 
type with null value in the set.
 Because, as for binary type, the {{HashSet}} doesn't work properly for 
{{Array[Byte]}}, and as for struct and array type with null value in the set, 
the {{ordering}} will throw a {{NPE}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

indraneel r updated SPARK-26206:

Description: 
Spark structured streaming with kafka integration fails in update mode with 
compilation exception in code generation. 
 Here's the code that was executed:
{code:java}
// code placeholder

override def main(args: Array[String]): Unit = {
  val spark = SparkSession
    .builder
    .master("local[*]")
    .appName("SparkStreamingTest")
    .getOrCreate()
 
  val kafkaParams = Map[String, String](
   "kafka.bootstrap.servers" -> "localhost:9092",
   "startingOffsets" -> "earliest",
   "subscribe" -> "test_events")
 
  val schema = Encoders.product[UserEvent].schema
  val query = spark.readStream.format("kafka")
    .options(kafkaParams)
    .load()
    .selectExpr("CAST(value AS STRING) as message")
    .select(from_json(col("message"), schema).as("json"))
    .select("json.*")
    .groupBy(window(col("event_time"), "10 minutes"))
    .count()
    .writeStream
    .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
  println(s"batch : ${batchId}")
  batch.show(false)
    }
    .outputMode("update")
    .start()

    query.awaitTermination()
}{code}
It succeeds for batch 0 but fails for batch 1 with following exception when 
more data is arrives in the stream.
{code:java}
18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
    at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
    at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
    at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
    at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
    at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
    at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
    at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
    at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981)
    at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406)
    at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:432)
    at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:411)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:406)
    at 

[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

indraneel r updated SPARK-26206:

Description: 
Spark structured streaming with kafka integration fails in update mode with 
compilation exception in code generation. 
 Here's the code that was executed:
{code:java}
// code placeholder

override def main(args: Array[String]): Unit = {
  val spark = SparkSession
    .builder
    .master("local[*]")
    .appName("SparkStreamingTest")
    .getOrCreate()
 
  val kafkaParams = Map[String, String](
   "kafka.bootstrap.servers" -> "localhost:9092",
   "startingOffsets" -> "earliest",
   "subscribe" -> "test_events")
 
  val schema = Encoders.product[UserEvent].schema
  val query = spark.readStream.format("kafka")
    .options(kafkaParams)
    .load()
    .selectExpr("CAST(value AS STRING) as message")
    .select(from_json(col("message"), schema).as("json"))
    .select("json.*")
    .groupBy(window(col("event_time"), "10 minutes"))
    .count()
    .writeStream
    .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
  println(s"batch : ${batchId}")
  batch.show(false)
    }
    .outputMode("update")
    .start()

    query.awaitTermination()
}{code}
It succeeds for batch 0 but fails for batch 1 with following exception when 
more data is arrives in the stream.
{code:java}
18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
    at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
    at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
    at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
    at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
    at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
    at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
    at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
    at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981)
    at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406)
    at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:432)
    at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:411)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:406)
    at 

[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

indraneel r updated SPARK-26206:

Description: 
Spark structured streaming with kafka integration fails in update mode with 
compilation exception in code generation. 
 Here's the code that was executed:
{code:java}
// code placeholder

override def main(args: Array[String]): Unit = {
  val spark = SparkSession
    .builder
    .master("local[*]")
    .appName("SparkStreamingTest")
    .getOrCreate()
 
  val kafkaParams = Map[String, String](
   "kafka.bootstrap.servers" -> "localhost:9092",
   "startingOffsets" -> "earliest",
   "subscribe" -> "test_events")
 
  val schema = Encoders.product[UserEvent].schema
  val query = spark.readStream.format("kafka")
    .options(kafkaParams)
    .load()
    .selectExpr("CAST(value AS STRING) as message")
    .select(from_json(col("message"), schema).as("json"))
    .select("json.*")
    .groupBy(window(col("event_time"), "10 minutes"))
    .count()
    .writeStream
    .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
  println(s"batch : ${batchId}")
  batch.show(false)
    }
    .outputMode("update")
    .start()

    query.awaitTermination()
}{code}
It succeeds for batch 0 but fails for batch 1 with following exception when 
more data is arrives in the stream.
{code:java}
18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
    at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
    at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
    at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
    at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
    at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
    at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
    at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
    at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981)
    at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406)
    at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:432)
    at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:411)
    at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:406)
    at 

[jira] [Commented] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702757#comment-16702757
 ] 

indraneel r commented on SPARK-26206:
-

[~kabhwan] 
Have added the details in the description.

> Spark structured streaming with kafka integration fails in update mode 
> ---
>
> Key: SPARK-26206
> URL: https://issues.apache.org/jira/browse/SPARK-26206
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Operating system : MacOS Mojave
>  spark version : 2.4.0
> spark-sql-kafka-0-10 : 2.4.0
>  kafka version 1.1.1
> scala version : 2.12.7
>Reporter: indraneel r
>Priority: Blocker
>
> Spark structured streaming with kafka integration fails in update mode with 
> compilation exception in code generation. 
>  Here's the code that was executed:
> {code:java}
> // code placeholder
> override def main(args: Array[String]): Unit = {
>   val spark = SparkSession
>     .builder
>     .master("local[*]")
>     .appName("SparkStreamingTest")
>     .getOrCreate()
>  
>   val kafkaParams = Map[String, String](
>    "kafka.bootstrap.servers" -> "localhost:9092",
>    "startingOffsets" -> "earliest",
>    "subscribe" -> "test_events")
>  
>   val schema = Encoders.product[UserEvent].schema
>   val query = spark.readStream.format("kafka")
>     .options(kafkaParams)
>     .load()
>     .selectExpr("CAST(value AS STRING) as message")
>     .select(from_json(col("message"), schema).as("json"))
>     .select("json.*")
>     .groupBy(window(col("event_time"), "10 minutes"))
>     .count()
>     .writeStream
>     .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
>   println(s"batch : ${batchId}")
>   batch.show(false)
>     }
>     .outputMode("update")
>     .start()
>     query.awaitTermination()
> }{code}
> It succeeds for batch 0 but fails for batch 1 with following exception when 
> more data is arrives in the stream.
> {code:java}
> 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
>     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
>     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
>     at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
>     at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
>     at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
>     at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
>     at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215)

[jira] [Created] (SPARK-26210) Streaming of syslog

2018-11-28 Thread Manoj (JIRA)
Manoj created SPARK-26210:
-

 Summary: Streaming of syslog
 Key: SPARK-26210
 URL: https://issues.apache.org/jira/browse/SPARK-26210
 Project: Spark
  Issue Type: Question
  Components: Spark Submit
Affects Versions: 2.0.2
Reporter: Manoj


Hi Team,

We are able to consumes the data from local hosts using the spark consumer.

We are planning to capture the syslog from different host .

is there any machanisam where spark process will listening the incoming sylog 
tcp messages .

and also please let us know is flume can be replace by spark code ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-11-28 Thread xuqianjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702714#comment-16702714
 ] 

xuqianjin commented on SPARK-23410:
---

[~maxgekk] Thank you very much. I'll get started on this as soon as possible.

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26142) Implement shuffle read metrics in SQL

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702687#comment-16702687
 ] 

Apache Spark commented on SPARK-26142:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/23175

> Implement shuffle read metrics in SQL
> -
>
> Key: SPARK-26142
> URL: https://issues.apache.org/jira/browse/SPARK-26142
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26133) Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder

2018-11-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26133:
-

Assignee: Liang-Chi Hsieh

> Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to 
> OneHotEncoder
> --
>
> Key: SPARK-26133
> URL: https://issues.apache.org/jira/browse/SPARK-26133
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> We have deprecated OneHotEncoder at Spark 2.3.0 and introduced 
> OneHotEncoderEstimator. At 3.0.0, we remove deprecated OneHotEncoder and 
> rename OneHotEncoderEstimator to OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26133) Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder

2018-11-28 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-26133.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23100
[https://github.com/apache/spark/pull/23100]

> Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to 
> OneHotEncoder
> --
>
> Key: SPARK-26133
> URL: https://issues.apache.org/jira/browse/SPARK-26133
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> We have deprecated OneHotEncoder at Spark 2.3.0 and introduced 
> OneHotEncoderEstimator. At 3.0.0, we remove deprecated OneHotEncoder and 
> rename OneHotEncoderEstimator to OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch

2018-11-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26207:
--
Priority: Minor  (was: Major)

> add PowerIterationClustering  (PIC) doc in 2.4 branch
> -
>
> Key: SPARK-26207
> URL: https://issues.apache.org/jira/browse/SPARK-26207
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> add PIC documentation in docs/ml-clustering.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-11-28 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702515#comment-16702515
 ] 

Reynold Xin commented on SPARK-24498:
-

Why don't we close the ticket? I heard we would get mixed performance, and it 
doesn't seem like worth it to add 500 lines of code for something that has 
unclear value.

 

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26188:

Target Version/s: 2.4.1

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Critical
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, 
> range: 0-662, 

[jira] [Updated] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26188:

Component/s: (was: Spark Core)
 SQL

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Critical
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: 
> 

[jira] [Updated] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26188:

Priority: Critical  (was: Minor)

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Critical
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, 
> 

[jira] [Commented] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702465#comment-16702465
 ] 

Jungtaek Lim commented on SPARK-26206:
--

[~indraneelrr]

Could you also provide how UserEvent is constructed, and generated code if 
available? 

You can turn on debug log for 
"org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator" and rerun to 
get full source code. I think if exception occurs it should log generated code 
at INFO level.

> Spark structured streaming with kafka integration fails in update mode 
> ---
>
> Key: SPARK-26206
> URL: https://issues.apache.org/jira/browse/SPARK-26206
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Operating system : MacOS Mojave
>  spark version : 2.4.0
> spark-sql-kafka-0-10 : 2.4.0
>  kafka version 1.1.1
> scala version : 2.12.7
>Reporter: indraneel r
>Priority: Blocker
>
> Spark structured streaming with kafka integration fails in update mode with 
> compilation exception in code generation. 
> Here's the code that was executed:
> {code:java}
> // code placeholder
> override def main(args: Array[String]): Unit = {
>   val spark = SparkSession
>     .builder
>     .master("local[*]")
>     .appName("SparkStreamingTest")
>     .getOrCreate()
>  
>   val kafkaParams = Map[String, String](
>    "kafka.bootstrap.servers" -> "localhost:9092",
>    "startingOffsets" -> "earliest",
>    "subscribe" -> "test_events")
>  
>   val schema = Encoders.product[UserEvent].schema
>   val query = spark.readStream.format("kafka")
>     .options(kafkaParams)
>     .load()
>     .selectExpr("CAST(value AS STRING) as message")
>     .select(from_json(col("message"), schema).as("json"))
>     .select("json.*")
>     .groupBy(window(col("event_time"), "10 minutes"))
>     .count()
>     .writeStream
>     .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
>   println(s"batch : ${batchId}")
>   batch.show(false)
>     }
>     .outputMode("update")
>     .start()
>     query.awaitTermination()
> }{code}
> It succeeds for batch 0 but fails for batch 1 with following exception when 
> more data is arrives in the stream.
> {code:java}
> 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
>     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
>     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
>     at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
>     at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
>     at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
>     at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
>     at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
>     at 
> 

[jira] [Created] (SPARK-26209) Allow for dataframe bucketization without Hive

2018-11-28 Thread Walt Elder (JIRA)
Walt Elder created SPARK-26209:
--

 Summary: Allow for dataframe bucketization without Hive
 Key: SPARK-26209
 URL: https://issues.apache.org/jira/browse/SPARK-26209
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Java API, SQL
Affects Versions: 2.4.0
Reporter: Walt Elder


As a DataFrame author, I can elect to bucketize my output without involving 
Hive or HMS, so that my hive-less environment can benefit from this 
query-optimization technique. 

 

https://issues.apache.org/jira/browse/SPARK-19256?focusedCommentId=16345397=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16345397
 identifies this as a shortcoming with the umbrella feature in provided via 
SPARK-19256.

 

In short, relying on Hive to store metadata *precludes* environments which 
don't have/use hive from making use of bucketization features. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26194) Support automatic spark.authenticate secret in Kubernetes backend

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702397#comment-16702397
 ] 

Apache Spark commented on SPARK-26194:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23174

> Support automatic spark.authenticate secret in Kubernetes backend
> -
>
> Key: SPARK-26194
> URL: https://issues.apache.org/jira/browse/SPARK-26194
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> Currently k8s inherits the default behavior for {{spark.authenticate}}, which 
> is that the user must provide an auth secret.
> k8s doesn't have that requirement and could instead generate its own unique 
> per-app secret, and propagate it to executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26194) Support automatic spark.authenticate secret in Kubernetes backend

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26194:


Assignee: (was: Apache Spark)

> Support automatic spark.authenticate secret in Kubernetes backend
> -
>
> Key: SPARK-26194
> URL: https://issues.apache.org/jira/browse/SPARK-26194
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> Currently k8s inherits the default behavior for {{spark.authenticate}}, which 
> is that the user must provide an auth secret.
> k8s doesn't have that requirement and could instead generate its own unique 
> per-app secret, and propagate it to executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26194) Support automatic spark.authenticate secret in Kubernetes backend

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702395#comment-16702395
 ] 

Apache Spark commented on SPARK-26194:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23174

> Support automatic spark.authenticate secret in Kubernetes backend
> -
>
> Key: SPARK-26194
> URL: https://issues.apache.org/jira/browse/SPARK-26194
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> Currently k8s inherits the default behavior for {{spark.authenticate}}, which 
> is that the user must provide an auth secret.
> k8s doesn't have that requirement and could instead generate its own unique 
> per-app secret, and propagate it to executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26194) Support automatic spark.authenticate secret in Kubernetes backend

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26194:


Assignee: Apache Spark

> Support automatic spark.authenticate secret in Kubernetes backend
> -
>
> Key: SPARK-26194
> URL: https://issues.apache.org/jira/browse/SPARK-26194
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> Currently k8s inherits the default behavior for {{spark.authenticate}}, which 
> is that the user must provide an auth secret.
> k8s doesn't have that requirement and could instead generate its own unique 
> per-app secret, and propagate it to executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-11-28 Thread Dave DeCaprio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702363#comment-16702363
 ] 

Dave DeCaprio commented on SPARK-23904:
---

I've created a pull request that will address this.  It limits the size of 
these debug strings.  https://github.com/apache/spark/pull/23169

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-11-28 Thread Dave DeCaprio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702362#comment-16702362
 ] 

Dave DeCaprio commented on SPARK-25380:
---

I've created a PR that should address this.  It limits the size of text plans 
that are created.  [https://github.com/apache/spark/pull/23169] 

 

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26208) Empty dataframe does not roundtrip for csv with header

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26208:


Assignee: (was: Apache Spark)

> Empty dataframe does not roundtrip for csv with header
> --
>
> Key: SPARK-26208
> URL: https://issues.apache.org/jira/browse/SPARK-26208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: master branch,
> commit 034ae305c33b1990b3c1a284044002874c343b4d,
> date:   Sun Nov 18 16:02:15 2018 +0800
>Reporter: koert kuipers
>Priority: Minor
>
> when we write empty part file for csv and header=true we fail to write 
> header. the result cannot be read back in.
> when header=true a part file with zero rows should still have header



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26208) Empty dataframe does not roundtrip for csv with header

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702345#comment-16702345
 ] 

Apache Spark commented on SPARK-26208:
--

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/23173

> Empty dataframe does not roundtrip for csv with header
> --
>
> Key: SPARK-26208
> URL: https://issues.apache.org/jira/browse/SPARK-26208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: master branch,
> commit 034ae305c33b1990b3c1a284044002874c343b4d,
> date:   Sun Nov 18 16:02:15 2018 +0800
>Reporter: koert kuipers
>Priority: Minor
>
> when we write empty part file for csv and header=true we fail to write 
> header. the result cannot be read back in.
> when header=true a part file with zero rows should still have header



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26208) Empty dataframe does not roundtrip for csv with header

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26208:


Assignee: Apache Spark

> Empty dataframe does not roundtrip for csv with header
> --
>
> Key: SPARK-26208
> URL: https://issues.apache.org/jira/browse/SPARK-26208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: master branch,
> commit 034ae305c33b1990b3c1a284044002874c343b4d,
> date:   Sun Nov 18 16:02:15 2018 +0800
>Reporter: koert kuipers
>Assignee: Apache Spark
>Priority: Minor
>
> when we write empty part file for csv and header=true we fail to write 
> header. the result cannot be read back in.
> when header=true a part file with zero rows should still have header



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702313#comment-16702313
 ] 

Apache Spark commented on SPARK-25957:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23172

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
> Fix For: 3.0.0
>
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26205) Optimize In expression for bytes, shorts, ints

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702295#comment-16702295
 ] 

Apache Spark commented on SPARK-26205:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/23171

> Optimize In expression for bytes, shorts, ints
> --
>
> Key: SPARK-26205
> URL: https://issues.apache.org/jira/browse/SPARK-26205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> Currently, {{In}} expressions are compiled into a sequence of if-else 
> statements, which results in O\(n\) time complexity. {{InSet}} is an 
> optimized version of {{In}}, which is supposed to improve the performance if 
> the number of elements is big enough. However, {{InSet}} actually degrades 
> the performance in many cases due to various reasons (benchmarks will be 
> available in SPARK-26203 and solutions are discussed in SPARK-26204).
> The main idea of this JIRA is to make use of {{tableswitch}} and 
> {{lookupswitch}} bytecode instructions. In short, we can improve our time 
> complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java 
> {{switch}} statements. We will have O\(1\) time complexity if our case values 
> are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will 
> give us O\(log n\). 
> An important benefit of the proposed approach is that we do not have to pay 
> an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we 
> can substantially outperform {{InSet}} even on 250+ elements.
> See 
> [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10]
>  and 
> [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
>  for more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26205) Optimize In expression for bytes, shorts, ints

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702293#comment-16702293
 ] 

Apache Spark commented on SPARK-26205:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/23171

> Optimize In expression for bytes, shorts, ints
> --
>
> Key: SPARK-26205
> URL: https://issues.apache.org/jira/browse/SPARK-26205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> Currently, {{In}} expressions are compiled into a sequence of if-else 
> statements, which results in O\(n\) time complexity. {{InSet}} is an 
> optimized version of {{In}}, which is supposed to improve the performance if 
> the number of elements is big enough. However, {{InSet}} actually degrades 
> the performance in many cases due to various reasons (benchmarks will be 
> available in SPARK-26203 and solutions are discussed in SPARK-26204).
> The main idea of this JIRA is to make use of {{tableswitch}} and 
> {{lookupswitch}} bytecode instructions. In short, we can improve our time 
> complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java 
> {{switch}} statements. We will have O\(1\) time complexity if our case values 
> are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will 
> give us O\(log n\). 
> An important benefit of the proposed approach is that we do not have to pay 
> an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we 
> can substantially outperform {{InSet}} even on 250+ elements.
> See 
> [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10]
>  and 
> [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
>  for more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26205) Optimize In expression for bytes, shorts, ints

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26205:


Assignee: Apache Spark

> Optimize In expression for bytes, shorts, ints
> --
>
> Key: SPARK-26205
> URL: https://issues.apache.org/jira/browse/SPARK-26205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Assignee: Apache Spark
>Priority: Major
>
> Currently, {{In}} expressions are compiled into a sequence of if-else 
> statements, which results in O\(n\) time complexity. {{InSet}} is an 
> optimized version of {{In}}, which is supposed to improve the performance if 
> the number of elements is big enough. However, {{InSet}} actually degrades 
> the performance in many cases due to various reasons (benchmarks will be 
> available in SPARK-26203 and solutions are discussed in SPARK-26204).
> The main idea of this JIRA is to make use of {{tableswitch}} and 
> {{lookupswitch}} bytecode instructions. In short, we can improve our time 
> complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java 
> {{switch}} statements. We will have O\(1\) time complexity if our case values 
> are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will 
> give us O\(log n\). 
> An important benefit of the proposed approach is that we do not have to pay 
> an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we 
> can substantially outperform {{InSet}} even on 250+ elements.
> See 
> [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10]
>  and 
> [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
>  for more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26205) Optimize In expression for bytes, shorts, ints

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26205:


Assignee: (was: Apache Spark)

> Optimize In expression for bytes, shorts, ints
> --
>
> Key: SPARK-26205
> URL: https://issues.apache.org/jira/browse/SPARK-26205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> Currently, {{In}} expressions are compiled into a sequence of if-else 
> statements, which results in O\(n\) time complexity. {{InSet}} is an 
> optimized version of {{In}}, which is supposed to improve the performance if 
> the number of elements is big enough. However, {{InSet}} actually degrades 
> the performance in many cases due to various reasons (benchmarks will be 
> available in SPARK-26203 and solutions are discussed in SPARK-26204).
> The main idea of this JIRA is to make use of {{tableswitch}} and 
> {{lookupswitch}} bytecode instructions. In short, we can improve our time 
> complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java 
> {{switch}} statements. We will have O\(1\) time complexity if our case values 
> are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will 
> give us O\(log n\). 
> An important benefit of the proposed approach is that we do not have to pay 
> an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we 
> can substantially outperform {{InSet}} even on 250+ elements.
> See 
> [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10]
>  and 
> [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
>  for more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame

2018-11-28 Thread mayur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702225#comment-16702225
 ] 

mayur commented on SPARK-14948:
---

I am also facing this issue . any idea ETA would be great to know !

> Exception when joining DataFrames derived form the same DataFrame
> -
>
> Key: SPARK-14948
> URL: https://issues.apache.org/jira/browse/SPARK-14948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Saurabh Santhosh
>Priority: Major
>
> h2. Spark Analyser is throwing the following exception in a specific scenario 
> :
> h2. Exception :
> org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing 
> from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> JavaRDD rdd =
> 
> sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a",
>  "b")));
> DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new 
> StructType(fields));
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as 
> asd, F2 from t1");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, 
> "t2");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3");
> 
> DataFrame join = aliasedDf.join(df, 
> aliasedDf.col("F2").equalTo(df.col("F2")), "inner");
> DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1"));
> select.collect();
> {code}
> h2. Observations :
> * This issue is related to the Data Type of Fields of the initial Data 
> Frame.(If the Data Type is not String, it will work.)
> * It works fine if the data frame is registered as a temporary table and an 
> sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26208) Empty dataframe does not roundtrip for csv with header

2018-11-28 Thread koert kuipers (JIRA)
koert kuipers created SPARK-26208:
-

 Summary: Empty dataframe does not roundtrip for csv with header
 Key: SPARK-26208
 URL: https://issues.apache.org/jira/browse/SPARK-26208
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
 Environment: master branch,
commit 034ae305c33b1990b3c1a284044002874c343b4d,
date:   Sun Nov 18 16:02:15 2018 +0800

Reporter: koert kuipers


when we write empty part file for csv and header=true we fail to write header. 
the result cannot be read back in.

when header=true a part file with zero rows should still have header



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26103) OutOfMemory error with large query plans

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702213#comment-16702213
 ] 

Apache Spark commented on SPARK-26103:
--

User 'DaveDeCaprio' has created a pull request for this issue:
https://github.com/apache/spark/pull/23169

> OutOfMemory error with large query plans
> 
>
> Key: SPARK-26103
> URL: https://issues.apache.org/jira/browse/SPARK-26103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
> Environment: Amazon EMR 5.19
> 1 c5.4xlarge master instance
> 1 c5.4xlarge core instance
> 2 c5.4xlarge task instances
>Reporter: Dave DeCaprio
>Priority: Major
>
> Large query plans can cause OutOfMemory errors in the Spark driver.
> We are creating data frames that are not extremely large but contain lots of 
> nested joins.  These plans execute efficiently because of caching and 
> partitioning, but the text version of the query plans generated can be 
> hundreds of megabytes.  Running many of these in parallel causes our driver 
> process to fail.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at 
> java.util.Arrays.copyOfRange(Arrays.java:2694) at 
> java.lang.String.(String.java:203) at 
> java.lang.StringBuilder.toString(StringBuilder.java:405) at 
> scala.StringContext.standardInterpolator(StringContext.scala:125) at 
> scala.StringContext.s(StringContext.scala:90) at 
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>  
>  
> A similar error is reported in 
> [https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format]
>  
> Code exists to truncate the string if the number of output columns is larger 
> than 25, but not if the rest of the query plan is huge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24423) Add a new option `query` for JDBC sources

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702228#comment-16702228
 ] 

Apache Spark commented on SPARK-24423:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/23170

> Add a new option `query` for JDBC sources
> -
>
> Key: SPARK-24423
> URL: https://issues.apache.org/jira/browse/SPARK-24423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, our JDBC connector provides the option `dbtable` for users to 
> specify the to-be-loaded JDBC source table. 
> {code} 
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", "dbName.tableName")
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  Normally, users do not fetch the whole JDBC table due to the poor 
> performance/throughput of JDBC. Thus, they normally just fetch a small set of 
> tables. For advanced users, they can pass a subquery as the option.   
> {code} 
>  val query = """ (select * from tableName limit 10) as tmp """
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  However, this is straightforward to end users. We should simply allow users 
> to specify the query by a new option `query`. We will handle the complexity 
> for them. 
> {code} 
>  val query = """select * from tableName limit 10"""
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*{color:#ff}query{color}*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  Users are not allowed to specify query and dbtable at the same time. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26103) OutOfMemory error with large query plans

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702219#comment-16702219
 ] 

Apache Spark commented on SPARK-26103:
--

User 'DaveDeCaprio' has created a pull request for this issue:
https://github.com/apache/spark/pull/23169

> OutOfMemory error with large query plans
> 
>
> Key: SPARK-26103
> URL: https://issues.apache.org/jira/browse/SPARK-26103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
> Environment: Amazon EMR 5.19
> 1 c5.4xlarge master instance
> 1 c5.4xlarge core instance
> 2 c5.4xlarge task instances
>Reporter: Dave DeCaprio
>Priority: Major
>
> Large query plans can cause OutOfMemory errors in the Spark driver.
> We are creating data frames that are not extremely large but contain lots of 
> nested joins.  These plans execute efficiently because of caching and 
> partitioning, but the text version of the query plans generated can be 
> hundreds of megabytes.  Running many of these in parallel causes our driver 
> process to fail.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at 
> java.util.Arrays.copyOfRange(Arrays.java:2694) at 
> java.lang.String.(String.java:203) at 
> java.lang.StringBuilder.toString(StringBuilder.java:405) at 
> scala.StringContext.standardInterpolator(StringContext.scala:125) at 
> scala.StringContext.s(StringContext.scala:90) at 
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>  
>  
> A similar error is reported in 
> [https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format]
>  
> Code exists to truncate the string if the number of output columns is larger 
> than 25, but not if the rest of the query plan is huge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702205#comment-16702205
 ] 

Apache Spark commented on SPARK-26207:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23168

> add PowerIterationClustering  (PIC) doc in 2.4 branch
> -
>
> Key: SPARK-26207
> URL: https://issues.apache.org/jira/browse/SPARK-26207
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> add PIC documentation in docs/ml-clustering.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702204#comment-16702204
 ] 

Apache Spark commented on SPARK-26207:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23168

> add PowerIterationClustering  (PIC) doc in 2.4 branch
> -
>
> Key: SPARK-26207
> URL: https://issues.apache.org/jira/browse/SPARK-26207
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> add PIC documentation in docs/ml-clustering.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26207:


Assignee: (was: Apache Spark)

> add PowerIterationClustering  (PIC) doc in 2.4 branch
> -
>
> Key: SPARK-26207
> URL: https://issues.apache.org/jira/browse/SPARK-26207
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> add PIC documentation in docs/ml-clustering.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26207:


Assignee: Apache Spark

> add PowerIterationClustering  (PIC) doc in 2.4 branch
> -
>
> Key: SPARK-26207
> URL: https://issues.apache.org/jira/browse/SPARK-26207
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> add PIC documentation in docs/ml-clustering.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26207) add PowerIterationClustering (PIC) doc in 2.4 branch

2018-11-28 Thread Huaxin Gao (JIRA)
Huaxin Gao created SPARK-26207:
--

 Summary: add PowerIterationClustering  (PIC) doc in 2.4 branch
 Key: SPARK-26207
 URL: https://issues.apache.org/jira/browse/SPARK-26207
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 2.4.0
Reporter: Huaxin Gao


add PIC documentation in docs/ml-clustering.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

indraneel r updated SPARK-26206:

Environment: 
Operating system : MacOS Mojave
 spark version : 2.4.0
spark-sql-kafka-0-10 : 2.4.0
 kafka version 1.1.1

scala version : 2.12.7

  was:
Operating system : MacOS Mojave
 spark version : 2.4.0
 spark-streaming-kafka-0-10 : 2.4.0
 kafka version 1.1.1

scala version : 2.12.7


> Spark structured streaming with kafka integration fails in update mode 
> ---
>
> Key: SPARK-26206
> URL: https://issues.apache.org/jira/browse/SPARK-26206
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Operating system : MacOS Mojave
>  spark version : 2.4.0
> spark-sql-kafka-0-10 : 2.4.0
>  kafka version 1.1.1
> scala version : 2.12.7
>Reporter: indraneel r
>Priority: Blocker
>
> Spark structured streaming with kafka integration fails in update mode with 
> compilation exception in code generation. 
> Here's the code that was executed:
> {code:java}
> // code placeholder
> override def main(args: Array[String]): Unit = {
>   val spark = SparkSession
>     .builder
>     .master("local[*]")
>     .appName("SparkStreamingTest")
>     .getOrCreate()
>  
>   val kafkaParams = Map[String, String](
>    "kafka.bootstrap.servers" -> "localhost:9092",
>    "startingOffsets" -> "earliest",
>    "subscribe" -> "test_events")
>  
>   val schema = Encoders.product[UserEvent].schema
>   val query = spark.readStream.format("kafka")
>     .options(kafkaParams)
>     .load()
>     .selectExpr("CAST(value AS STRING) as message")
>     .select(from_json(col("message"), schema).as("json"))
>     .select("json.*")
>     .groupBy(window(col("event_time"), "10 minutes"))
>     .count()
>     .writeStream
>     .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
>   println(s"batch : ${batchId}")
>   batch.show(false)
>     }
>     .outputMode("update")
>     .start()
>     query.awaitTermination()
> }{code}
> It succeeds for batch 0 but fails for batch 1 with following exception when 
> more data is arrives in the stream.
> {code:java}
> 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
>     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
>     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
>     at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
>     at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
>     at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
>     at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
>     at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
>     at 

[jira] [Commented] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702146#comment-16702146
 ] 

Apache Spark commented on SPARK-26024:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23167

> Dataset API: repartitionByRange(...) has inconsistent behaviour
> ---
>
> Key: SPARK-26024
> URL: https://issues.apache.org/jira/browse/SPARK-26024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
> Environment: Spark version 2.3.2
>Reporter: Julien Peloton
>Assignee: Julien Peloton
>Priority: Major
>  Labels: dataFrame, partitioning, repartition, spark-sql
> Fix For: 3.0.0
>
>
> Hi,
> I recently played with the {{repartitionByRange}} method for DataFrame 
> introduced in SPARK-22614. For DataFrames larger than the one tested in the 
> code (which has only 10 elements), the code sends back random results.
> As a test for showing the inconsistent behaviour, I start as the unit code 
> used to test {{repartitionByRange}} 
> ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352])
>  but I increase the size of the initial array to 1000, repartition using 3 
> partitions, and count the number of element per-partitions:
>  
> {code}
> // Shuffle numbers from 0 to 1000, and make a DataFrame
> val df = Random.shuffle(0.to(1000)).toDF("val")
> // Repartition it using 3 partitions
> // Sum up number of elements in each partition, and collect it.
> // And do it several times
> for (i <- 0 to 9) {
>   var counts = df.repartitionByRange(3, col("val"))
>     .mapPartitions{part => Iterator(part.size)}
> .collect()
>   println(counts.toList)
> }
> // -> the number of elements in each partition varies...
> {code}
> I do not know whether it is expected (I will dig further in the code), but it 
> sounds like a bug.
>  Or I just misinterpret what {{repartitionByRange}} is for?
>  Any ideas?
> Thanks!
>  Julien



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

indraneel r updated SPARK-26206:

Environment: 
Operating system : MacOS Mojave
 spark version : 2.4.0
 spark-streaming-kafka-0-10 : 2.4.0
 kafka version 1.1.1

scala version : 2.12.7

  was:
Operating system : MacOS Mojave
spark version : 2.4.0
spark-streaming-kafka-0-10 : 2.4.0
kafka version 1.1.1


> Spark structured streaming with kafka integration fails in update mode 
> ---
>
> Key: SPARK-26206
> URL: https://issues.apache.org/jira/browse/SPARK-26206
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Operating system : MacOS Mojave
>  spark version : 2.4.0
>  spark-streaming-kafka-0-10 : 2.4.0
>  kafka version 1.1.1
> scala version : 2.12.7
>Reporter: indraneel r
>Priority: Blocker
>
> Spark structured streaming with kafka integration fails in update mode with 
> compilation exception in code generation. 
> Here's the code that was executed:
> {code:java}
> // code placeholder
> override def main(args: Array[String]): Unit = {
>   val spark = SparkSession
>     .builder
>     .master("local[*]")
>     .appName("SparkStreamingTest")
>     .getOrCreate()
>  
>   val kafkaParams = Map[String, String](
>    "kafka.bootstrap.servers" -> "localhost:9092",
>    "startingOffsets" -> "earliest",
>    "subscribe" -> "test_events")
>  
>   val schema = Encoders.product[UserEvent].schema
>   val query = spark.readStream.format("kafka")
>     .options(kafkaParams)
>     .load()
>     .selectExpr("CAST(value AS STRING) as message")
>     .select(from_json(col("message"), schema).as("json"))
>     .select("json.*")
>     .groupBy(window(col("event_time"), "10 minutes"))
>     .count()
>     .writeStream
>     .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
>   println(s"batch : ${batchId}")
>   batch.show(false)
>     }
>     .outputMode("update")
>     .start()
>     query.awaitTermination()
> }{code}
> It succeeds for batch 0 but fails for batch 1 with following exception when 
> more data is arrives in the stream.
> {code:java}
> 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
>     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
>     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
>     at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
>     at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
>     at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
>     at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
>     at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
>     at 

[jira] [Commented] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702118#comment-16702118
 ] 

Apache Spark commented on SPARK-26201:
--

User 'redsanket' has created a pull request for this issue:
https://github.com/apache/spark/pull/23166

> python broadcast.value on driver fails with disk encryption enabled
> ---
>
> Key: SPARK-26201
> URL: https://issues.apache.org/jira/browse/SPARK-26201
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was trying python with rpc and disk encryption enabled and when I tried a 
> python broadcast variable and just read the value back on the driver side the 
> job failed with:
>  
> Traceback (most recent call last): File "broadcast.py", line 37, in  
> words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value 
> File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File 
> "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of 
> input
> To reproduce use configs: --conf spark.network.crypto.enabled=true --conf 
> spark.io.encryption.enabled=true
>  
> Code:
> words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"])
>  words_new.value
>  print(words_new.value)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26205) Optimize In expression for bytes, shorts, ints

2018-11-28 Thread Anton Okolnychyi (JIRA)
Anton Okolnychyi created SPARK-26205:


 Summary: Optimize In expression for bytes, shorts, ints
 Key: SPARK-26205
 URL: https://issues.apache.org/jira/browse/SPARK-26205
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Anton Okolnychyi


Currently, {{In}} expressions are compiled into a sequence of if-else 
statements, which results in O\(n\) time complexity. {{InSet}} is an optimized 
version of {{In}}, which is supposed to improve the performance if the number 
of elements is big enough. However, {{InSet}} actually degrades the performance 
in many cases due to various reasons (benchmarks will be available in 
SPARK-26203 and solutions are discussed in SPARK-26204).

The main idea of this JIRA is to make use of {{tableswitch}} and 
{{lookupswitch}} bytecode instructions. In short, we can improve our time 
complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java 
{{switch}} statements. We will have O\(1\) time complexity if our case values 
are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will 
give us O\(log n\). 

An important benefit of the proposed approach is that we do not have to pay an 
extra cost for autoboxing as in case of {{InSet}}. As a consequence, we can 
substantially outperform {{InSet}} even on 250+ elements.

See 
[here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] 
and here for 
[more|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
 information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-28 Thread indraneel r (JIRA)
indraneel r created SPARK-26206:
---

 Summary: Spark structured streaming with kafka integration fails 
in update mode 
 Key: SPARK-26206
 URL: https://issues.apache.org/jira/browse/SPARK-26206
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
 Environment: Operating system : MacOS Mojave
spark version : 2.4.0
spark-streaming-kafka-0-10 : 2.4.0
kafka version 1.1.1
Reporter: indraneel r


Spark structured streaming with kafka integration fails in update mode with 
compilation exception in code generation. 
Here's the code that was executed:
{code:java}
// code placeholder

override def main(args: Array[String]): Unit = {
  val spark = SparkSession
    .builder
    .master("local[*]")
    .appName("SparkStreamingTest")
    .getOrCreate()
 
  val kafkaParams = Map[String, String](
   "kafka.bootstrap.servers" -> "localhost:9092",
   "startingOffsets" -> "earliest",
   "subscribe" -> "test_events")
 
  val schema = Encoders.product[UserEvent].schema
  val query = spark.readStream.format("kafka")
    .options(kafkaParams)
    .load()
    .selectExpr("CAST(value AS STRING) as message")
    .select(from_json(col("message"), schema).as("json"))
    .select("json.*")
    .groupBy(window(col("event_time"), "10 minutes"))
    .count()
    .writeStream
    .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
  println(s"batch : ${batchId}")
  batch.show(false)
    }
    .outputMode("update")
    .start()

    query.awaitTermination()
}{code}
It succeeds for batch 0 but fails for batch 1 with following exception when 
more data is arrives in the stream.
{code:java}
18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 25, 
Column 18: A method named "putLong" is not declared in any enclosing class nor 
any supertype, nor through a static import
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
    at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
    at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
    at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
    at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
    at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
    at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
    at 
org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
    at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
    at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
    at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
    at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:981)
    at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:215)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:414)
    at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:406)
    at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1295)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:406)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1306)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:848)
    at 

[jira] [Updated] (SPARK-26205) Optimize In expression for bytes, shorts, ints

2018-11-28 Thread Anton Okolnychyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-26205:
-
Description: 
Currently, {{In}} expressions are compiled into a sequence of if-else 
statements, which results in O\(n\) time complexity. {{InSet}} is an optimized 
version of {{In}}, which is supposed to improve the performance if the number 
of elements is big enough. However, {{InSet}} actually degrades the performance 
in many cases due to various reasons (benchmarks will be available in 
SPARK-26203 and solutions are discussed in SPARK-26204).

The main idea of this JIRA is to make use of {{tableswitch}} and 
{{lookupswitch}} bytecode instructions. In short, we can improve our time 
complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java 
{{switch}} statements. We will have O\(1\) time complexity if our case values 
are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will 
give us O\(log n\). 

An important benefit of the proposed approach is that we do not have to pay an 
extra cost for autoboxing as in case of {{InSet}}. As a consequence, we can 
substantially outperform {{InSet}} even on 250+ elements.

See 
[here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] 
and 
[here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
 for more information.

  was:
Currently, {{In}} expressions are compiled into a sequence of if-else 
statements, which results in O\(n\) time complexity. {{InSet}} is an optimized 
version of {{In}}, which is supposed to improve the performance if the number 
of elements is big enough. However, {{InSet}} actually degrades the performance 
in many cases due to various reasons (benchmarks will be available in 
SPARK-26203 and solutions are discussed in SPARK-26204).

The main idea of this JIRA is to make use of {{tableswitch}} and 
{{lookupswitch}} bytecode instructions. In short, we can improve our time 
complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java 
{{switch}} statements. We will have O\(1\) time complexity if our case values 
are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will 
give us O\(log n\). 

An important benefit of the proposed approach is that we do not have to pay an 
extra cost for autoboxing as in case of {{InSet}}. As a consequence, we can 
substantially outperform {{InSet}} even on 250+ elements.

See 
[here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] 
and here for 
[more|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
 information.


> Optimize In expression for bytes, shorts, ints
> --
>
> Key: SPARK-26205
> URL: https://issues.apache.org/jira/browse/SPARK-26205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> Currently, {{In}} expressions are compiled into a sequence of if-else 
> statements, which results in O\(n\) time complexity. {{InSet}} is an 
> optimized version of {{In}}, which is supposed to improve the performance if 
> the number of elements is big enough. However, {{InSet}} actually degrades 
> the performance in many cases due to various reasons (benchmarks will be 
> available in SPARK-26203 and solutions are discussed in SPARK-26204).
> The main idea of this JIRA is to make use of {{tableswitch}} and 
> {{lookupswitch}} bytecode instructions. In short, we can improve our time 
> complexity from O\(n\) to O\(1\) or at least O\(log n\) by using Java 
> {{switch}} statements. We will have O\(1\) time complexity if our case values 
> are compact and {{tableswitch}} can be used. Otherwise, {{lookupswitch}} will 
> give us O\(log n\). 
> An important benefit of the proposed approach is that we do not have to pay 
> an extra cost for autoboxing as in case of {{InSet}}. As a consequence, we 
> can substantially outperform {{InSet}} even on 250+ elements.
> See 
> [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10]
>  and 
> [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
>  for more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26201:


Assignee: Apache Spark

> python broadcast.value on driver fails with disk encryption enabled
> ---
>
> Key: SPARK-26201
> URL: https://issues.apache.org/jira/browse/SPARK-26201
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> I was trying python with rpc and disk encryption enabled and when I tried a 
> python broadcast variable and just read the value back on the driver side the 
> job failed with:
>  
> Traceback (most recent call last): File "broadcast.py", line 37, in  
> words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value 
> File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File 
> "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of 
> input
> To reproduce use configs: --conf spark.network.crypto.enabled=true --conf 
> spark.io.encryption.enabled=true
>  
> Code:
> words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"])
>  words_new.value
>  print(words_new.value)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26201:


Assignee: (was: Apache Spark)

> python broadcast.value on driver fails with disk encryption enabled
> ---
>
> Key: SPARK-26201
> URL: https://issues.apache.org/jira/browse/SPARK-26201
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was trying python with rpc and disk encryption enabled and when I tried a 
> python broadcast variable and just read the value back on the driver side the 
> job failed with:
>  
> Traceback (most recent call last): File "broadcast.py", line 37, in  
> words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value 
> File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File 
> "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of 
> input
> To reproduce use configs: --conf spark.network.crypto.enabled=true --conf 
> spark.io.encryption.enabled=true
>  
> Code:
> words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"])
>  words_new.value
>  print(words_new.value)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26204) Optimize InSet expression

2018-11-28 Thread Anton Okolnychyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-26204:
-
Description: 
The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time 
complexity in the {{In}} expression. As {{InSet}} relies on Scala 
{{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the 
performance of {{InSet}} might be significantly slower than {{In}} even on 100+ 
values.

We need to find an approach how to optimize {{InSet}} expressions and avoid the 
cost of autoboxing.

 There are a few approaches that we can use:
 * Collections for primitive values (e.g., FastUtil,  HPPC)
 * Type specialization in Scala (would it even work for code gen in Spark?)

I tried to use {{OpenHashSet}}, which is already available in Spark and uses 
type specialization. However, I did not manage to avoid autoboxing. On the 
other hand, FastUtil did work and I saw a substantial improvement in the 
performance.

See the attached screenshot of what I experienced while testing.
 

  was:
The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time 
complexity in the `In` expression. As {{InSet}} relies on Scala 
{{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the 
performance of {{InSet}} might be significantly slower than {{In}} even on 100+ 
values.

We need to find an approach how to optimize {{InSet}} expressions and avoid the 
cost of autoboxing.

 There are a few approaches that we can use:
 * Collections for primitive values (e.g., FastUtil,  HPPC)
 * Type specialization in Scala (would it even work for code gen in Spark?)

I tried to use {{OpenHashSet}}, which is already available in Spark and uses 
type specialization. However, I did not manage to avoid autoboxing. On the 
other hand, FastUtil did work and I saw a substantial improvement in the 
performance.

See the attached screenshot of what I experienced while testing.
 


> Optimize InSet expression
> -
>
> Key: SPARK-26204
> URL: https://issues.apache.org/jira/browse/SPARK-26204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
> Attachments: heap size.png
>
>
> The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time 
> complexity in the {{In}} expression. As {{InSet}} relies on Scala 
> {{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the 
> performance of {{InSet}} might be significantly slower than {{In}} even on 
> 100+ values.
> We need to find an approach how to optimize {{InSet}} expressions and avoid 
> the cost of autoboxing.
>  There are a few approaches that we can use:
>  * Collections for primitive values (e.g., FastUtil,  HPPC)
>  * Type specialization in Scala (would it even work for code gen in Spark?)
> I tried to use {{OpenHashSet}}, which is already available in Spark and uses 
> type specialization. However, I did not manage to avoid autoboxing. On the 
> other hand, FastUtil did work and I saw a substantial improvement in the 
> performance.
> See the attached screenshot of what I experienced while testing.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26204) Optimize InSet expression

2018-11-28 Thread Anton Okolnychyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-26204:
-
Attachment: heap size.png

> Optimize InSet expression
> -
>
> Key: SPARK-26204
> URL: https://issues.apache.org/jira/browse/SPARK-26204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
> Attachments: heap size.png
>
>
> The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time 
> complexity in the `In` expression. As {{InSet}} relies on Scala 
> {{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the 
> performance of {{InSet}} might be significantly slower than {{In}} even on 
> 100+ values.
> We need to find an approach how to optimize {{InSet}} expressions and avoid 
> the cost of autoboxing.
>  There are a few approaches that we can use:
>  * Collections for primitive values (e.g., FastUtil,  HPPC)
>  * Type specialization in Scala (would it even work for code gen in Spark?)
> I tried to use {{OpenHashSet}}, which is already available in Spark and uses 
> type specialization. However, I did not manage to avoid autoboxing. On the 
> other hand, FastUtil did work and I saw a substantial improvement in the 
> performance.
> See the attached screenshot of what I experienced while testing.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26204) Optimize InSet expression

2018-11-28 Thread Anton Okolnychyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-26204:
-
Attachment: (was: fastutils.png)

> Optimize InSet expression
> -
>
> Key: SPARK-26204
> URL: https://issues.apache.org/jira/browse/SPARK-26204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time 
> complexity in the `In` expression. As {{InSet}} relies on Scala 
> {{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the 
> performance of {{InSet}} might be significantly slower than {{In}} even on 
> 100+ values.
> We need to find an approach how to optimize {{InSet}} expressions and avoid 
> the cost of autoboxing.
>  There are a few approaches that we can use:
>  * Collections for primitive values (e.g., FastUtil,  HPPC)
>  * Type specialization in Scala (would it even work for code gen in Spark?)
> I tried to use {{OpenHashSet}}, which is already available in Spark and uses 
> type specialization. However, I did not manage to avoid autoboxing. On the 
> other hand, FastUtil did work and I saw a substantial improvement in the 
> performance.
> See the attached screenshot of what I experienced while testing.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26204) Optimize InSet expression

2018-11-28 Thread Anton Okolnychyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-26204:
-
Attachment: fastutils.png

> Optimize InSet expression
> -
>
> Key: SPARK-26204
> URL: https://issues.apache.org/jira/browse/SPARK-26204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
> Attachments: fastutils.png
>
>
> The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time 
> complexity in the `In` expression. As {{InSet}} relies on Scala 
> {{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the 
> performance of {{InSet}} might be significantly slower than {{In}} even on 
> 100+ values.
> We need to find an approach how to optimize {{InSet}} expressions and avoid 
> the cost of autoboxing.
>  There are a few approaches that we can use:
>  * Collections for primitive values (e.g., FastUtil,  HPPC)
>  * Type specialization in Scala (would it even work for code gen in Spark?)
> I tried to use {{OpenHashSet}}, which is already available in Spark and uses 
> type specialization. However, I did not manage to avoid autoboxing. On the 
> other hand, FastUtil did work and I saw a substantial improvement in the 
> performance.
> See the attached screenshot of what I experienced while testing.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26204) Optimize InSet expression

2018-11-28 Thread Anton Okolnychyi (JIRA)
Anton Okolnychyi created SPARK-26204:


 Summary: Optimize InSet expression
 Key: SPARK-26204
 URL: https://issues.apache.org/jira/browse/SPARK-26204
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Anton Okolnychyi
 Attachments: fastutils.png

The {{InSet}} expression was introduced in SPARK-3711 to avoid O\(n\) time 
complexity in the `In` expression. As {{InSet}} relies on Scala 
{{immutable.Set}}, it introduces expensive autoboxing. As a consequence, the 
performance of {{InSet}} might be significantly slower than {{In}} even on 100+ 
values.

We need to find an approach how to optimize {{InSet}} expressions and avoid the 
cost of autoboxing.

 There are a few approaches that we can use:
 * Collections for primitive values (e.g., FastUtil,  HPPC)
 * Type specialization in Scala (would it even work for code gen in Spark?)

I tried to use {{OpenHashSet}}, which is already available in Spark and uses 
type specialization. However, I did not manage to avoid autoboxing. On the 
other hand, FastUtil did work and I saw a substantial improvement in the 
performance.

See the attached screenshot of what I experienced while testing.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-11-28 Thread Anton Okolnychyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-26203:
-
Description: 
{{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
are literals. This was done for performance reasons to avoid O\(n\) time 
complexity for {{In}}.

The original optimization was done in SPARK-3711. A lot has changed after that 
(e.g., generation of Java code to evaluate expressions), so it is worth to 
measure the performance of this optimization again.

According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
{{In}} due to autoboxing and other issues.

The scope of this JIRA is to benchmark every supported data type inside {{In}} 
and {{InSet}} and outline existing bottlenecks. Once we have this information, 
we can come up with solutions. 

Based on my preliminary investigation, we can do quite some optimizations, 
which quite frequently depend on a specific data type.


  was:
{{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
are literals. This was done for performance reasons to avoid O(n) time 
complexity for {{In}}.

The original optimization was done in SPARK-3711. A lot has changed after that 
(e.g., generation of Java code to evaluate expressions), so it is worth to 
measure the performance of this optimization again.

According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
{{In}} due to autoboxing and other issues.

The scope of this JIRA is to benchmark every supported data type inside {{In}} 
and {{InSet}} and outline existing bottlenecks. Once we have this information, 
we can come up with solutions. 

Based on my preliminary investigation, we can do quite some optimizations, 
which quite frequently depend on a specific data type.



> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702096#comment-16702096
 ] 

Apache Spark commented on SPARK-26188:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23165

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Minor
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 14:02:50 

[jira] [Assigned] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26188:


Assignee: Apache Spark

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Assignee: Apache Spark
>Priority: Minor
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: 
> 

[jira] [Commented] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-28 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702095#comment-16702095
 ] 

Gengliang Wang commented on SPARK-26188:


[~ddgirard]Thanks for the investigation. I have created 
https://github.com/apache/spark/pull/23165 to fix it.

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Minor
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 

[jira] [Assigned] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26188:


Assignee: (was: Apache Spark)

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Minor
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: 
> 

[jira] [Updated] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-11-28 Thread Anton Okolnychyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-26203:
-
Description: 
{{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
are literals. This was done for performance reasons to avoid O(n) time 
complexity for {{In}}.

The original optimization was done in SPARK-3711. A lot has changed after that 
(e.g., generation of Java code to evaluate expressions), so it is worth to 
measure the performance of this optimization again.

According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
{{In}} due to autoboxing and other issues.

The scope of this JIRA is to benchmark every supported data type inside {{In}} 
and {{InSet}} and outline existing bottlenecks. Once we have this information, 
we can come up with solutions. 

Based on my preliminary investigation, we can do quite some optimizations, 
which quite frequently depend on a specific data type.


  was:
{{OptimizeIn}} rule that replaces {{In}} with {{InSet}} if the number of 
possible values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all 
values are literals. This was done for performance reasons to avoid O(n) time 
complexity for {{In}}.

The original optimization was done in SPARK-3711. A lot has changed after that 
(e.g., generation of Java code to evaluate expressions), so it is worth to 
measure the performance of this optimization again.

According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
{{In}} due to autoboxing and other issues.

The scope of this JIRA is to benchmark every supported data type inside {{In}} 
and {{InSet}} and outline existing bottlenecks. Once we have this information, 
we can come up with solutions. 

Based on my preliminary investigation, we can do quite some optimizations, 
which quite frequently depend on a specific data type.



> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O(n) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-11-28 Thread Anton Okolnychyi (JIRA)
Anton Okolnychyi created SPARK-26203:


 Summary: Benchmark performance of In and InSet expressions
 Key: SPARK-26203
 URL: https://issues.apache.org/jira/browse/SPARK-26203
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Anton Okolnychyi


{{OptimizeIn}} rule that replaces {{In}} with {{InSet}} if the number of 
possible values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all 
values are literals. This was done for performance reasons to avoid O(n) time 
complexity for {{In}}.

The original optimization was done in SPARK-3711. A lot has changed after that 
(e.g., generation of Java code to evaluate expressions), so it is worth to 
measure the performance of this optimization again.

According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
{{In}} due to autoboxing and other issues.

The scope of this JIRA is to benchmark every supported data type inside {{In}} 
and {{InSet}} and outline existing bottlenecks. Once we have this information, 
we can come up with solutions. 

Based on my preliminary investigation, we can do quite some optimizations, 
which quite frequently depend on a specific data type.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26202) R bucketBy

2018-11-28 Thread Huaxin Gao (JIRA)
Huaxin Gao created SPARK-26202:
--

 Summary: R bucketBy
 Key: SPARK-26202
 URL: https://issues.apache.org/jira/browse/SPARK-26202
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Add R version of DataFrameWriter.bucketBy



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21291) R partitionBy API

2018-11-28 Thread Huaxin Gao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-21291:
---
Summary: R partitionBy API  (was: R bucketBy partitionBy API)

> R partitionBy API
> -
>
> Key: SPARK-21291
> URL: https://issues.apache.org/jira/browse/SPARK-21291
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> partitionBy exists but it's for windowspec only



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25829) remove duplicated map keys with last wins policy

2018-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25829:

Summary: remove duplicated map keys with last wins policy  (was: Duplicated 
map keys are not handled consistently)

> remove duplicated map keys with last wins policy
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25829) Duplicated map keys are not handled consistently

2018-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25829.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23124
[https://github.com/apache/spark/pull/23124]

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled

2018-11-28 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702046#comment-16702046
 ] 

Sanket Reddy commented on SPARK-26201:
--

Thanks [~tgraves] will put up the patch shortly

> python broadcast.value on driver fails with disk encryption enabled
> ---
>
> Key: SPARK-26201
> URL: https://issues.apache.org/jira/browse/SPARK-26201
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was trying python with rpc and disk encryption enabled and when I tried a 
> python broadcast variable and just read the value back on the driver side the 
> job failed with:
>  
> Traceback (most recent call last): File "broadcast.py", line 37, in  
> words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value 
> File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File 
> "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of 
> input
> To reproduce use configs: --conf spark.network.crypto.enabled=true --conf 
> spark.io.encryption.enabled=true
>  
> Code:
> words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"])
>  words_new.value
>  print(words_new.value)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25831) should apply "earlier entry wins" in hive map value converter

2018-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25831.
-
Resolution: Invalid

> should apply "earlier entry wins" in hive map value converter
> -
>
> Key: SPARK-25831
> URL: https://issues.apache.org/jira/browse/SPARK-25831
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> {code}
> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
> res11: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from t").show
> ++
> | map|
> ++
> |[1 -> 3]|
> ++
> {code}
> We mistakenly apply "later entry wins"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25824) Remove duplicated map entries in `showString`

2018-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25824.
-
Resolution: Fixed

> Remove duplicated map entries in `showString`
> -
>
> Key: SPARK-25824
> URL: https://issues.apache.org/jira/browse/SPARK-25824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `showString` doesn't eliminate the duplication. So, it looks different from 
> the result of `collect` and select from saved rows.
> *Spark 2.2.2*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("SELECT map(1,2,1,3)").show
> +---+
> |map(1, 2, 1, 3)|
> +---+
> |Map(1 -> 3)|
> +---+
> {code}
> *Spark 2.3.0 ~ 2.4.0-rc4*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a")
> scala> sql("SELECT * FROM m").show
> ++
> |   a|
> ++
> |[1 -> 3]|
> ++
> scala> sql("SELECT map(1,2,1,3)").show
> ++
> | map(1, 2, 1, 3)|
> ++
> |[1 -> 2, 1 -> 3]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25830) should apply "earlier entry wins" in Dataset.collect

2018-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25830.
-
Resolution: Invalid

> should apply "earlier entry wins" in Dataset.collect
> 
>
> Key: SPARK-25830
> URL: https://issues.apache.org/jira/browse/SPARK-25830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> {code}
> scala> sql("select map(1,2,1,3)").collect
> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> {code}
> We mistakenly apply "later entry wins"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25829) Duplicated map keys are not handled consistently

2018-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25829:
---

Assignee: Wenchen Fan

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled

2018-11-28 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702028#comment-16702028
 ] 

Thomas Graves commented on SPARK-26201:
---

the issue here seems to be that it isn't decrypting the file before trying to 
read it, we will have a patch up for this shortly.

> python broadcast.value on driver fails with disk encryption enabled
> ---
>
> Key: SPARK-26201
> URL: https://issues.apache.org/jira/browse/SPARK-26201
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was trying python with rpc and disk encryption enabled and when I tried a 
> python broadcast variable and just read the value back on the driver side the 
> job failed with:
>  
> Traceback (most recent call last): File "broadcast.py", line 37, in  
> words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value 
> File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File 
> "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of 
> input
> To reproduce use configs: --conf spark.network.crypto.enabled=true --conf 
> spark.io.encryption.enabled=true
>  
> Code:
> words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"])
>  words_new.value
>  print(words_new.value)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly

2018-11-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25989.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22991
[https://github.com/apache/spark/pull/22991]

> OneVsRestModel handle empty outputCols incorrectly
> --
>
> Key: SPARK-25989
> URL: https://issues.apache.org/jira/browse/SPARK-25989
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> {\{ml.classification.ClassificationModel}} will ignore empty output columns.
> However, \{{OneVsRestModel}} still try to append new column even if its name 
> is an empty string.
> {code:java}
> scala> ovrModel.setPredictionCol("").transform(test).show
> +-+++---+
> |label| features| rawPrediction| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> scala> 
> ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show
> +-+++---+
> |label| features| raw| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25989) OneVsRestModel handle empty outputCols incorrectly

2018-11-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25989:
-

Assignee: zhengruifeng

> OneVsRestModel handle empty outputCols incorrectly
> --
>
> Key: SPARK-25989
> URL: https://issues.apache.org/jira/browse/SPARK-25989
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> {\{ml.classification.ClassificationModel}} will ignore empty output columns.
> However, \{{OneVsRestModel}} still try to append new column even if its name 
> is an empty string.
> {code:java}
> scala> ovrModel.setPredictionCol("").transform(test).show
> +-+++---+
> |label| features| rawPrediction| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> scala> 
> ovrModel.setPredictionCol("").setRawPredictionCol("raw").transform(test).show
> +-+++---+
> |label| features| raw| |
> +-+++---+
> | 0.0|(4,[0,1,2,3],[-0|[-0.0965652626152...|2.0|
> | 0.0|(4,[0,1,2,3],[-0|[0.07880609384635...|2.0|
> | 0.0|(4,[0,1,2,3],[-1|[0.01891571586984...|2.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.72409973016524...|0.0|
> | 0.0|(4,[0,1,2,3],[0.1...|[0.48045978946729...|2.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[1.05496616040758...|0.0|
> | 0.0|(4,[0,1,2,3],[0.3...|[0.79508659065535...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.47437469552081...|0.0|
> | 0.0|(4,[0,1,2,3],[0.6...|[1.23302929670223...|0.0|
> | 0.0|(4,[0,1,2,3],[0.8...|[1.79816156359706...|0.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.1564309664080...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-3.2217906250571...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.9171126308553...|1.0|
> | 1.0|(4,[0,1,2,3],[-0|[-2.8316993051998...|1.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.6486206847760...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9252139721697...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9025379528484...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.8518243169707...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-1.0990190524225...|2.0|
> | 2.0|(4,[0,1,2,3],[-0|[-0.9973479746889...|2.0|
> +-+++---+
> only showing top 20 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled

2018-11-28 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-26201:
-

 Summary: python broadcast.value on driver fails with disk 
encryption enabled
 Key: SPARK-26201
 URL: https://issues.apache.org/jira/browse/SPARK-26201
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.2
Reporter: Thomas Graves


I was trying python with rpc and disk encryption enabled and when I tried a 
python broadcast variable and just read the value back on the driver side the 
job failed with:

 

Traceback (most recent call last): File "broadcast.py", line 37, in  
words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value 
File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File 
"pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of input

To reproduce use configs: --conf spark.network.crypto.enabled=true --conf 
spark.io.encryption.enabled=true

 

Code:

words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"])
 words_new.value
 print(words_new.value)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object

2018-11-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25998.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22995
[https://github.com/apache/spark/pull/22995]

> TorrentBroadcast holds strong reference to broadcast object
> ---
>
> Key: SPARK-25998
> URL: https://issues.apache.org/jira/browse/SPARK-25998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Brandon Krieger
>Assignee: Brandon Krieger
>Priority: Major
> Fix For: 3.0.0
>
>
> If we do a large number of broadcast joins while holding onto the Dataset 
> reference, it will hold onto a large amount of memory for the value of the 
> broadcast object. The broadcast object is also held in the MemoryStore, but 
> that will clean itself up to prevent its memory usage from going over a 
> certain level. In my use case, I don't want to release the reference to the 
> Dataset (which would allow the broadcast object to be GCed) because I want to 
> be able to unpersist it at some point in the future (when it is no longer 
> relevant).
> See the following repro in Spark shell:
> {code:java}
> import org.apache.spark.sql.functions._
> import org.apache.spark.SparkEnv
> val startDf = (1 to 100).toDF("num").withColumn("num", 
> $"num".cast("string")).cache()
> val leftDf = startDf.withColumn("num", concat($"num", lit("0")))
> val rightDf = startDf.withColumn("num", concat($"num", lit("1")))
> val broadcastJoinedDf = leftDf.join(broadcast(rightDf), 
> leftDf.col("num").eqNullSafe(rightDf.col("num")))
> broadcastJoinedDf.count
> // Take a heap dump, see UnsafeHashedRelation with hard references in 
> MemoryStore and Dataset
> // Force the MemoryStore to clear itself
> SparkEnv.get.blockManager.stop
> // Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now 
> referenced only by the Dataset.
> {code}
> If we make the TorrentBroadcast hold a weak reference to the broadcast 
> object, the second heap dump will show nothing; the UnsafeHashedRelation has 
> been GCed.
> Given that the broadcast object can be reloaded from the MemoryStore, it 
> seems like it would be alright to use a WeakReference instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25998) TorrentBroadcast holds strong reference to broadcast object

2018-11-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25998:
-

Assignee: Brandon Krieger

> TorrentBroadcast holds strong reference to broadcast object
> ---
>
> Key: SPARK-25998
> URL: https://issues.apache.org/jira/browse/SPARK-25998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Brandon Krieger
>Assignee: Brandon Krieger
>Priority: Major
> Fix For: 3.0.0
>
>
> If we do a large number of broadcast joins while holding onto the Dataset 
> reference, it will hold onto a large amount of memory for the value of the 
> broadcast object. The broadcast object is also held in the MemoryStore, but 
> that will clean itself up to prevent its memory usage from going over a 
> certain level. In my use case, I don't want to release the reference to the 
> Dataset (which would allow the broadcast object to be GCed) because I want to 
> be able to unpersist it at some point in the future (when it is no longer 
> relevant).
> See the following repro in Spark shell:
> {code:java}
> import org.apache.spark.sql.functions._
> import org.apache.spark.SparkEnv
> val startDf = (1 to 100).toDF("num").withColumn("num", 
> $"num".cast("string")).cache()
> val leftDf = startDf.withColumn("num", concat($"num", lit("0")))
> val rightDf = startDf.withColumn("num", concat($"num", lit("1")))
> val broadcastJoinedDf = leftDf.join(broadcast(rightDf), 
> leftDf.col("num").eqNullSafe(rightDf.col("num")))
> broadcastJoinedDf.count
> // Take a heap dump, see UnsafeHashedRelation with hard references in 
> MemoryStore and Dataset
> // Force the MemoryStore to clear itself
> SparkEnv.get.blockManager.stop
> // Trigger GC, then take another Heap Dump. The UnsafeHashedRelation is now 
> referenced only by the Dataset.
> {code}
> If we make the TorrentBroadcast hold a weak reference to the broadcast 
> object, the second heap dump will show nothing; the UnsafeHashedRelation has 
> been GCed.
> Given that the broadcast object can be reloaded from the MemoryStore, it 
> seems like it would be alright to use a WeakReference instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26137) Linux file separator is hard coded in DependencyUtils used in deploy process

2018-11-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26137.
---
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0
   2.3.3

Issue resolved by pull request 23102
[https://github.com/apache/spark/pull/23102]

> Linux file separator is hard coded in DependencyUtils used in deploy process
> 
>
> Key: SPARK-26137
> URL: https://issues.apache.org/jira/browse/SPARK-26137
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Mark Pavey
>Assignee: Mark Pavey
>Priority: Major
> Fix For: 2.3.3, 3.0.0, 2.4.1
>
>
> During deployment, while downloading dependencies the code tries to remove 
> multiple copies of the application jar from the driver classpath. The Linux 
> file separator ("/") is hard coded here so on Windows multiple copies of the 
> jar are not removed.
> This has a knock on effect when trying to use elasticsearch-spark as this 
> library does not run if there are multiple copies of the application jar on 
> the driver classpath.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26137) Linux file separator is hard coded in DependencyUtils used in deploy process

2018-11-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26137:
-

Assignee: Mark Pavey

> Linux file separator is hard coded in DependencyUtils used in deploy process
> 
>
> Key: SPARK-26137
> URL: https://issues.apache.org/jira/browse/SPARK-26137
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Mark Pavey
>Assignee: Mark Pavey
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> During deployment, while downloading dependencies the code tries to remove 
> multiple copies of the application jar from the driver classpath. The Linux 
> file separator ("/") is hard coded here so on Windows multiple copies of the 
> jar are not removed.
> This has a knock on effect when trying to use elasticsearch-spark as this 
> library does not run if there are multiple copies of the application jar on 
> the driver classpath.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26198) Metadata serialize null values throw NPE

2018-11-28 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-26198:

Description: 
How to reproduce this issue:
{code}
scala> val meta = new 
org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json
java.lang.NullPointerException
  at 
org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196)
  at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180)
{code}

  was:
How to reproduce this issue:

{code:scala}
scala> val meta = new 
org.apache.spark.sql.types.MetadataBuilder().putNull("key").build()
java.lang.NullPointerException
  at 
org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196)
  at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180)
{code}



> Metadata serialize null values throw NPE
> 
>
> Key: SPARK-26198
> URL: https://issues.apache.org/jira/browse/SPARK-26198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code}
> scala> val meta = new 
> org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196)
>   at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-28 Thread Yang Jie (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701964#comment-16701964
 ] 

Yang Jie commented on SPARK-26155:
--

[~Jk_Self] Can you try to add `-XX:+UseSuperWord` to 
spark.executor.extraJavaOptions to confirm whether L486&487 break the SIMD 
optimization of jvm.

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

2018-11-28 Thread David Lyness (JIRA)
David Lyness created SPARK-26200:


 Summary: Column values are incorrectly transposed when a field in 
a PySpark Row requires serialization
 Key: SPARK-26200
 URL: https://issues.apache.org/jira/browse/SPARK-26200
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
 Environment: {noformat}
 __
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/

Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
Branch
Compiled by user on 2018-10-29T06:22:05Z{noformat}
The same issue is observed when PySpark is run on both macOS 10.13.6 and CentOS 
7, so this appears to be a cross-platform issue.
Reporter: David Lyness


h2. Description of issue

Whenever a field in a PySpark {{Row}} requires serialization (such as a 
{{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
will assign column values *in alphabetical order*, rather than assigning each 
column value to its specified columns.
h3. Code to reproduce:
{code:java}
import datetime
from pyspark.sql import Row
from pyspark.sql.session import SparkSession
from pyspark.sql.types import DateType, StringType, StructField, StructType


spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField("date_column", DateType()),
StructField("my_b_column", StringType()),
StructField("my_a_column", StringType()),
])

spark.createDataFrame([Row(
date_column=datetime.date.today(),
my_b_column="my_b_value",
my_a_column="my_a_value"
)], schema).show()
{code}
h3. Expected result:
{noformat}
+---+---+---+
|date_column|my_b_column|my_a_column|
+---+---+---+
| 2018-11-28| my_b_value| my_a_value|
+---+---+---+{noformat}
h3. Actual result:
{noformat}
+---+---+---+
|date_column|my_b_column|my_a_column|
+---+---+---+
| 2018-11-28| my_a_value| my_b_value|
+---+---+---+{noformat}
(Note that {{my_a_value}} and {{my_b_value}} are transposed.)
h2. Analysis of issue

Reviewing [the relevant code on 
GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
 there are two relevant conditional blocks:

 
{code:java}
if self._needSerializeAnyField:
# Block 1, does not work correctly
else:
# Block 2, works correctly
{code}
{{Row}} is implemented as both a tuple of alphabetically-sorted columns, and a 
dictionary of named columns. In Block 2, there is a conditional that works 
specifically to serialize a {{Row}} object:

 
{code:java}
elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
return tuple(obj[n] for n in self.names)
{code}
There is no such condition in Block 1, so we fall into this instead:

 
{code:java}
elif isinstance(obj, (tuple, list)):
return tuple(f.toInternal(v) if c else v
for f, v, c in zip(self.fields, obj, self._needConversion))
{code}
The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
return a different ordering than the schema fields. So we end up with:
{code:java}
(date, date, True),
(b, a, False),
(a, b, False)
{code}
h2. Workarounds

Correct behaviour is observed if you use a Python {{list}} or {{dict}} instead 
of PySpark's {{Row}} object:

 
{code:java}
# Using a list works
spark.createDataFrame([[
datetime.date.today(),
"my_b_value",
"my_a_value"
]], schema)

# Using a dict also works
spark.createDataFrame([{
"date_column": datetime.date.today(),
"my_b_column": "my_b_value",
"my_a_column": "my_a_value"
}], schema){code}
Correct behaviour is also observed if you have no fields that require 
serialization; in this example, changing {{date_column}} to {{StringType}} 
avoids the correctness issue.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

2018-11-28 Thread David Lyness (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lyness updated SPARK-26200:
-
Environment: 
Spark version 2.4.0

Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144

The same issue is observed when PySpark is run on both macOS 10.13.6 and CentOS 
7, so this appears to be a cross-platform issue.

  was:
{noformat}
 __
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/

Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
Branch
Compiled by user on 2018-10-29T06:22:05Z{noformat}
The same issue is observed when PySpark is run on both macOS 10.13.6 and CentOS 
7, so this appears to be a cross-platform issue.


> Column values are incorrectly transposed when a field in a PySpark Row 
> requires serialization
> -
>
> Key: SPARK-26200
> URL: https://issues.apache.org/jira/browse/SPARK-26200
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
> The same issue is observed when PySpark is run on both macOS 10.13.6 and 
> CentOS 7, so this appears to be a cross-platform issue.
>Reporter: David Lyness
>Priority: Major
>  Labels: correctness
>
> h2. Description of issue
> Whenever a field in a PySpark {{Row}} requires serialization (such as a 
> {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
> will assign column values *in alphabetical order*, rather than assigning each 
> column value to its specified columns.
> h3. Code to reproduce:
> {code:java}
> import datetime
> from pyspark.sql import Row
> from pyspark.sql.session import SparkSession
> from pyspark.sql.types import DateType, StringType, StructField, StructType
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([
> StructField("date_column", DateType()),
> StructField("my_b_column", StringType()),
> StructField("my_a_column", StringType()),
> ])
> spark.createDataFrame([Row(
> date_column=datetime.date.today(),
> my_b_column="my_b_value",
> my_a_column="my_a_value"
> )], schema).show()
> {code}
> h3. Expected result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_b_value| my_a_value|
> +---+---+---+{noformat}
> h3. Actual result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_a_value| my_b_value|
> +---+---+---+{noformat}
> (Note that {{my_a_value}} and {{my_b_value}} are transposed.)
> h2. Analysis of issue
> Reviewing [the relevant code on 
> GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
>  there are two relevant conditional blocks:
>  
> {code:java}
> if self._needSerializeAnyField:
> # Block 1, does not work correctly
> else:
> # Block 2, works correctly
> {code}
> {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and 
> a dictionary of named columns. In Block 2, there is a conditional that works 
> specifically to serialize a {{Row}} object:
>  
> {code:java}
> elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
> return tuple(obj[n] for n in self.names)
> {code}
> There is no such condition in Block 1, so we fall into this instead:
>  
> {code:java}
> elif isinstance(obj, (tuple, list)):
> return tuple(f.toInternal(v) if c else v
> for f, v, c in zip(self.fields, obj, self._needConversion))
> {code}
> The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
> return a different ordering than the schema fields. So we end up with:
> {code:java}
> (date, date, True),
> (b, a, False),
> (a, b, False)
> {code}
> h2. Workarounds
> Correct behaviour is observed if you use a Python {{list}} or {{dict}} 
> instead of PySpark's {{Row}} object:
>  
> {code:java}
> # Using a list works
> spark.createDataFrame([[
> datetime.date.today(),
> "my_b_value",
> "my_a_value"
> ]], schema)
> # Using a dict also works
> spark.createDataFrame([{
> "date_column": datetime.date.today(),
> "my_b_column": "my_b_value",
> "my_a_column": "my_a_value"
> }], schema){code}
> Correct behaviour is also observed if you have no fields that require 
> serialization; in this example, changing {{date_column}} to {{StringType}} 
> avoids the correctness issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SPARK-26147) Python UDFs in join condition fail even when using columns from only one side of join

2018-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26147.
-
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23153
[https://github.com/apache/spark/pull/23153]

> Python UDFs in join condition fail even when using columns from only one side 
> of join
> -
>
> Key: SPARK-26147
> URL: https://issues.apache.org/jira/browse/SPARK-26147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ala Luszczak
>Priority: Major
> Fix For: 3.0.0, 2.4.1
>
>
> The rule {{PullOutPythonUDFInJoinCondition}} was implemented in 
> [https://github.com/apache/spark/commit/2a8cbfddba2a59d144b32910c68c22d0199093fe]
>  As far as I understand, this rule was intended to prevent the use of Python 
> UDFs in join condition if they take arguments from both sides of the join, 
> and this doesn't make sense in combination with the join type.
> The rule {{PullOutPythonUDFInJoinCondition}} seems to make an assumption, 
> that if a given UDF is only using columns from a single side of the join, it 
> will be already pushed down under the join before this rule is executed.
> However, this is not always the case. Here's a simple example that fails, 
> even though it looks like it should run just fine (and it does in earlier 
> versions of Spark):
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> cars_list = [ Row("NL", "1234AB"), Row("UK", "987654") ]
> insurance_list = [ Row("NL-1234AB"), Row("BE-112233") ]
> spark.createDataFrame(data = cars_list, schema = ["country", 
> "plate_nr"]).createOrReplaceTempView("cars")
> spark.createDataFrame(data = insurance_list, schema = 
> ["insurance_code"]).createOrReplaceTempView("insurance")
> to_insurance_code = udf(lambda x, y: x + "-" + y, StringType())   
> sqlContext.udf.register('to_insurance_code', to_insurance_code)
> spark.conf.set("spark.sql.crossJoin.enabled", "true")
> # This query runs just fine.
> sql("""
>   SELECT country, plate_nr, insurance_code
>   FROM cars LEFT OUTER JOIN insurance
>   ON CONCAT(country, '-', plate_nr) = insurance_code
> """).show()
> # This equivalent query fails with:
> # pyspark.sql.utils.AnalysisException: u'Using PythonUDF in join condition of 
> join type LeftOuter is not supported.;'
> sql("""
>   SELECT country, plate_nr, insurance_code
>   FROM cars LEFT OUTER JOIN insurance
>   ON to_insurance_code(country, plate_nr) = insurance_code
> """).show()
> {code}
> [~cloud_fan] [~XuanYuan] fyi



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26147) Python UDFs in join condition fail even when using columns from only one side of join

2018-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26147:
---

Assignee: Wenchen Fan

> Python UDFs in join condition fail even when using columns from only one side 
> of join
> -
>
> Key: SPARK-26147
> URL: https://issues.apache.org/jira/browse/SPARK-26147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ala Luszczak
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> The rule {{PullOutPythonUDFInJoinCondition}} was implemented in 
> [https://github.com/apache/spark/commit/2a8cbfddba2a59d144b32910c68c22d0199093fe]
>  As far as I understand, this rule was intended to prevent the use of Python 
> UDFs in join condition if they take arguments from both sides of the join, 
> and this doesn't make sense in combination with the join type.
> The rule {{PullOutPythonUDFInJoinCondition}} seems to make an assumption, 
> that if a given UDF is only using columns from a single side of the join, it 
> will be already pushed down under the join before this rule is executed.
> However, this is not always the case. Here's a simple example that fails, 
> even though it looks like it should run just fine (and it does in earlier 
> versions of Spark):
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> cars_list = [ Row("NL", "1234AB"), Row("UK", "987654") ]
> insurance_list = [ Row("NL-1234AB"), Row("BE-112233") ]
> spark.createDataFrame(data = cars_list, schema = ["country", 
> "plate_nr"]).createOrReplaceTempView("cars")
> spark.createDataFrame(data = insurance_list, schema = 
> ["insurance_code"]).createOrReplaceTempView("insurance")
> to_insurance_code = udf(lambda x, y: x + "-" + y, StringType())   
> sqlContext.udf.register('to_insurance_code', to_insurance_code)
> spark.conf.set("spark.sql.crossJoin.enabled", "true")
> # This query runs just fine.
> sql("""
>   SELECT country, plate_nr, insurance_code
>   FROM cars LEFT OUTER JOIN insurance
>   ON CONCAT(country, '-', plate_nr) = insurance_code
> """).show()
> # This equivalent query fails with:
> # pyspark.sql.utils.AnalysisException: u'Using PythonUDF in join condition of 
> join type LeftOuter is not supported.;'
> sql("""
>   SELECT country, plate_nr, insurance_code
>   FROM cars LEFT OUTER JOIN insurance
>   ON to_insurance_code(country, plate_nr) = insurance_code
> """).show()
> {code}
> [~cloud_fan] [~XuanYuan] fyi



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26198) Metadata serialize null values throw NPE

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701818#comment-16701818
 ] 

Apache Spark commented on SPARK-26198:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/23164

> Metadata serialize null values throw NPE
> 
>
> Key: SPARK-26198
> URL: https://issues.apache.org/jira/browse/SPARK-26198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26199) Long expressions cause mutate to fail

2018-11-28 Thread JIRA
João Rafael created SPARK-26199:
---

 Summary: Long expressions cause mutate to fail
 Key: SPARK-26199
 URL: https://issues.apache.org/jira/browse/SPARK-26199
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: João Rafael


Calling {{mutate(df, field = expr)}} fails when expr is very long.

Example:

{code:R}
df <- mutate(df, field = ifelse(
lit(TRUE),
lit("A"),
ifelse(
lit(T),
lit("BB"),
lit("C")
)
))
{code}

Stack trace:

{code:R}
FATAL subscript out of bounds
  at .handleSimpleError(function (obj) 
{
level = sapply(class(obj), sw
  at FUN(X[[i]], ...)
  at lapply(seq_along(args), function(i) {
if (ns[[i]] != "") {

at lapply(seq_along(args), function(i) {
if (ns[[i]] != "") {

at mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T), lit("BBB
  at #78: mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T
{code}

The root cause is in: 
[DataFrame.R#LL2182|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2182]

When the expression is long {{deparse}} returns multiple lines, causing 
{{args}} to have more elements than {{ns}}. The solution could be to set 
{{nlines = 1}} or to collapse the lines together.

A simple work around exists, by first placing the expression in a variable and 
using it instead:

{code:R}
tmp <- ifelse(
lit(TRUE),
lit("A"),
ifelse(
lit(T),
lit("BB"),
lit("C")
)
)
df <- mutate(df, field = tmp)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26198) Metadata serialize null values throw NPE

2018-11-28 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-26198:

Description: 
How to reproduce this issue:

{code:scala}
scala> val meta = new 
org.apache.spark.sql.types.MetadataBuilder().putNull("key").build()
java.lang.NullPointerException
  at 
org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196)
  at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180)
{code}


> Metadata serialize null values throw NPE
> 
>
> Key: SPARK-26198
> URL: https://issues.apache.org/jira/browse/SPARK-26198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> scala> val meta = new 
> org.apache.spark.sql.types.MetadataBuilder().putNull("key").build()
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196)
>   at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26198) Metadata serialize null values throw NPE

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26198:


Assignee: (was: Apache Spark)

> Metadata serialize null values throw NPE
> 
>
> Key: SPARK-26198
> URL: https://issues.apache.org/jira/browse/SPARK-26198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26198) Metadata serialize null values throw NPE

2018-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26198:


Assignee: Apache Spark

> Metadata serialize null values throw NPE
> 
>
> Key: SPARK-26198
> URL: https://issues.apache.org/jira/browse/SPARK-26198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26114:
---

Assignee: Sergey Zhemzhitsky

> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
> Key: SPARK-26114
> URL: https://issues.apache.org/jira/browse/SPARK-26114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
> Environment: Spark 3.0.0-SNAPSHOT (master branch)
> Scala 2.11
> Yarn 2.7
>Reporter: Sergey Zhemzhitsky
>Assignee: Sergey Zhemzhitsky
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, 
> run1-noparams-dominator-tree-externalsorter.png, 
> run1-noparams-dominator-tree.png
>
>
> Trying to use _coalesce_ after shuffle-oriented transformations leads to 
> OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
> GB of Y GB physical memory used. Consider 
> boostingspark.yarn.executor.memoryOverhead_.
> Discussion is 
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].
> The error happens when trying specify pretty small number of partitions in 
> _coalesce_ call.
> *How to reproduce?*
> # Start spark-shell
> {code:bash}
> spark-shell \ 
>   --num-executors=5 \ 
>   --executor-cores=2 \ 
>   --master=yarn \
>   --deploy-mode=client \ 
>   --conf spark.executor.memoryOverhead=512 \
>   --conf spark.executor.memory=1g \ 
>   --conf spark.dynamicAllocation.enabled=false \
>   --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
> {code}
> Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
> memory usage seems to be the only way to control the amount of memory used 
> for shuffle data transferring by now.
> Also note that the total number of cores allocated for job is 5x2=10
> # Then generate some test data
> {code:scala}
> import org.apache.hadoop.io._ 
> import org.apache.hadoop.io.compress._ 
> import org.apache.commons.lang._ 
> import org.apache.spark._ 
> // generate 100M records of sample data 
> sc.makeRDD(1 to 1000, 1000) 
>   .flatMap(item => (1 to 10) 
> .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) 
> -> new Text(RandomStringUtils.randomAlphanumeric(1024 
>   .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
> {code}
> # Run the sample job
> {code:scala}
> import org.apache.hadoop.io._
> import org.apache.spark._
> import org.apache.spark.storage._
> val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
> rdd 
>   .map(item => item._1.toString -> item._2.toString) 
>   .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
>   .coalesce(10,false) 
>   .count 
> {code}
> Note that the number of partitions is equal to the total number of cores 
> allocated to the job.
> Here is dominator tree from the heapdump
>  !run1-noparams-dominator-tree.png|width=700!
> 4 instances of ExternalSorter, although there are only 2 concurrently running 
> tasks per executor.
>  !run1-noparams-dominator-tree-externalsorter.png|width=700! 
> And paths to GC root of the already stopped ExternalSorter.
>  !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26114.
-
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23083
[https://github.com/apache/spark/pull/23083]

> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
> Key: SPARK-26114
> URL: https://issues.apache.org/jira/browse/SPARK-26114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
> Environment: Spark 3.0.0-SNAPSHOT (master branch)
> Scala 2.11
> Yarn 2.7
>Reporter: Sergey Zhemzhitsky
>Assignee: Sergey Zhemzhitsky
>Priority: Major
> Fix For: 3.0.0, 2.4.1
>
> Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, 
> run1-noparams-dominator-tree-externalsorter.png, 
> run1-noparams-dominator-tree.png
>
>
> Trying to use _coalesce_ after shuffle-oriented transformations leads to 
> OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
> GB of Y GB physical memory used. Consider 
> boostingspark.yarn.executor.memoryOverhead_.
> Discussion is 
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].
> The error happens when trying specify pretty small number of partitions in 
> _coalesce_ call.
> *How to reproduce?*
> # Start spark-shell
> {code:bash}
> spark-shell \ 
>   --num-executors=5 \ 
>   --executor-cores=2 \ 
>   --master=yarn \
>   --deploy-mode=client \ 
>   --conf spark.executor.memoryOverhead=512 \
>   --conf spark.executor.memory=1g \ 
>   --conf spark.dynamicAllocation.enabled=false \
>   --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
> {code}
> Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
> memory usage seems to be the only way to control the amount of memory used 
> for shuffle data transferring by now.
> Also note that the total number of cores allocated for job is 5x2=10
> # Then generate some test data
> {code:scala}
> import org.apache.hadoop.io._ 
> import org.apache.hadoop.io.compress._ 
> import org.apache.commons.lang._ 
> import org.apache.spark._ 
> // generate 100M records of sample data 
> sc.makeRDD(1 to 1000, 1000) 
>   .flatMap(item => (1 to 10) 
> .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) 
> -> new Text(RandomStringUtils.randomAlphanumeric(1024 
>   .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
> {code}
> # Run the sample job
> {code:scala}
> import org.apache.hadoop.io._
> import org.apache.spark._
> import org.apache.spark.storage._
> val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
> rdd 
>   .map(item => item._1.toString -> item._2.toString) 
>   .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
>   .coalesce(10,false) 
>   .count 
> {code}
> Note that the number of partitions is equal to the total number of cores 
> allocated to the job.
> Here is dominator tree from the heapdump
>  !run1-noparams-dominator-tree.png|width=700!
> 4 instances of ExternalSorter, although there are only 2 concurrently running 
> tasks per executor.
>  !run1-noparams-dominator-tree-externalsorter.png|width=700! 
> And paths to GC root of the already stopped ExternalSorter.
>  !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26198) Metadata serialize null values throw NPE

2018-11-28 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-26198:

Summary: Metadata serialize null values throw NPE  (was: Metadata serialize 
null value throw NPE)

> Metadata serialize null values throw NPE
> 
>
> Key: SPARK-26198
> URL: https://issues.apache.org/jira/browse/SPARK-26198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26198) Metadata serialize null value throw NPE

2018-11-28 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-26198:
---

 Summary: Metadata serialize null value throw NPE
 Key: SPARK-26198
 URL: https://issues.apache.org/jira/browse/SPARK-26198
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >