date:20180805

[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569675#comment-16569675
 ] 

Sean Owen commented on SPARK-25029:
---

[~kiszk] and/or [~lrytz] I wonder if this could be related ... if so would be 
due to how methods defined in traits are now encoded in interfaces? 
https://github.com/janino-compiler/janino/issues/47#issuecomment-410574546

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070)
> at

[jira] [Commented] (SPARK-20592) Alter table concatenate is not working as expected.

2018-08-05 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569673#comment-16569673
 ] 

Yuming Wang commented on SPARK-20592:
-

Spark doesn't support this command:

[https://github.com/apache/spark/blob/73dd6cf9b558f9d752e1f3c13584344257ad7863/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L217]

 

> Alter table concatenate is not working as expected.
> ---
>
> Key: SPARK-20592
> URL: https://issues.apache.org/jira/browse/SPARK-20592
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.2.1, 2.3.1
>Reporter: Guru Prabhakar Reddy Marthala
>Priority: Major
>  Labels: hive, pyspark
>
> Created a table using CTAS from csv to parquet.Parquet table generated 
> numerous small files.tried alter table concatenate but it's not working as 
> expected.
> spark.sql("CREATE TABLE flight.flight_data(year INT,   month INT,   day INT,  
>  day_of_week INT,   dep_time INT,   crs_dep_time INT,   arr_time INT,   
> crs_arr_time INT,   unique_carrier STRING,   flight_num INT,   tail_num 
> STRING,   actual_elapsed_time INT,   crs_elapsed_time INT,   air_time INT,   
> arr_delay INT,   dep_delay INT,   origin STRING,   dest STRING,   distance 
> INT,   taxi_in INT,   taxi_out INT,   cancelled INT,   cancellation_code 
> STRING,   diverted INT,   carrier_delay STRING,   weather_delay STRING,   
> nas_delay STRING,   security_delay STRING,   late_aircraft_delay STRING) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile")
> spark.sql("load data local INPATH 'i:/2008/2008.csv' INTO TABLE 
> flight.flight_data")
> spark.sql("create table flight.flight_data_pq stored as parquet as select * 
> from flight.flight_data")
> spark.sql("create table flight.flight_data_orc stored as orc as select * from 
> flight.flight_data")
> pyspark.sql.utils.ParseException: u'\nOperation not allowed: alter table 
> concatenate(line 1, pos 0)\n\n== SQL ==\nalter table 
> flight_data.flight_data_pq concatenate\n^^^\n'
> Tried on both orc and parquet format.It's not working.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Kazuaki Ishizaki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569621#comment-16569621
 ] 

Kazuaki Ishizaki commented on SPARK-25029:
--

[~srowen] I see. The following parts have the method. I will try to see it. My 
first feeling is that the problem may be in the scala collection library or 
catalyst Java code generator.

{code}
...
/* 146 */ final int length_1 = MapObjects_loopValue140.size();
...
/* 315 */ final int length_0 = MapObjects_loopValue140.size();
...
{code}


> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)
> at 
>

[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569602#comment-16569602
 ] 

Stavros Kontopoulos commented on SPARK-25029:
-

[~srowen] I think the reason for this is similar to what I describe here

[https://github.com/apache/spark/pull/22004/files#r207753682]

An analysis of the lambda body could clean this.

This is not captured in the design doc but we can add it later.

Part missing is access all fields of the lambda and do the cleaning for any 
referenced object if possible. [~lrytz] I guess this is possible right?

 

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at

[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569600#comment-16569600
 ] 

Sean Owen commented on SPARK-25029:
---

One of the 'task not serializable' issues is proving trickier:
{code:java}
[ERROR] 
runDTUsingStaticMethods(org.apache.spark.mllib.tree.JavaDecisionTreeSuite)  
Time elapsed: 0.375 s  <<< ERROR!

org.apache.spark.SparkException: Task not serializable

at 
org.apache.spark.mllib.tree.JavaDecisionTreeSuite.runDTUsingStaticMethods(JavaDecisionTreeSuite.java:81)

Caused by: java.io.NotSerializableException:

org.apache.spark.ml.tree.impl.RandomForest$

Serialization stack:

- object not serializable (class: org.apache.spark.ml.tree.impl.RandomForest$, 
value: org.apache.spark.ml.tree.impl.RandomForest$@74899df1)

- element of array (index: 0)

- array (class [Ljava.lang.Object;, size 7)

- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: 
class [Ljava.lang.Object;)

- object (class java.lang.invoke.SerializedLambda, 
SerializedLambda[capturingClass=class 
org.apache.spark.ml.tree.impl.RandomForest$, 
functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
 implementation=invokeStatic 
org/apache/spark/ml/tree/impl/RandomForest$.$anonfun$findBestSplits$21:(Lorg/apache/spark/ml/tree/impl/RandomForest$;Lorg/apache/spark/ml/tree/impl/DecisionTreeMetadata;Lscala/collection/immutable/Map;Lscala/collection/immutable/Map;[[Lorg/apache/spark/ml/tree/Split;ILorg/apache/spark/broadcast/Broadcast;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
 
instantiatedMethodType=(Lscala/collection/Iterator;)Lscala/collection/Iterator;,
 numCaptured=7])

- writeReplace data (class: java.lang.invoke.SerializedLambda)

- object (class 
org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$2726/1630334286, 
org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$2726/1630334286@17e04cc5)

at 
org.apache.spark.mllib.tree.JavaDecisionTreeSuite.runDTUsingStaticMethods(JavaDecisionTreeSuite.java:81){code}
 

Looks like something in 
org.apache.spark.ml.tree.impl.RandomForest.findBestSplits is capturing the 
containing RandomForest class, which isn't serializable. It's just an object, 
with no fields, we can make it Serializable trivially to resolve that. But that 
one is less trivial a manifestation than the others.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
>

[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569592#comment-16569592
 ] 

Stavros Kontopoulos commented on SPARK-25029:
-

[~srowen] cool will wait for the PR!

Regarding the TaskNotSerializable I agree. The specific example above though is 
not using a closure AFAIK. It is an ordinary class passed in some field. Could 
you point me to a failing test that uses a closure and needs cleaning (I havent 
run every possible build, I might be overlooking some) ? I might be able to 
debug then. 

Regarding the janino thing, it is outside my area too, still compilation takes 
into consideration Scala generated classes. Thus, my understanding is that it 
is not just about janino,  it is also about how janino treats/sees scala 
generated classes when it tries to compile that code listed above in text. 

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at

[jira] [Assigned] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25029:


Assignee: (was: Apache Spark)

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$ConditionalExpression.accept(Java.java:4344)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at

[jira] [Assigned] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25029:


Assignee: Apache Spark

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$ConditionalExpression.accept(Java.java:4344)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at

[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569591#comment-16569591
 ] 

Apache Spark commented on SPARK-25029:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22004

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$ConditionalExpression.accept(Java.java:4344)
> at

[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588
 ] 

Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 11:00 PM:
--

[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

What we need to verify is how things has changed for scala 2.12 and make this 
now failing.

[~lrytz] thoughts?

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if compiled/run with 2.11.

BytecodeUtilsSuite is also broken because it uses asm to check class contents, 
but with lambdas there are no classes generated (should not be hard to fix).


was (Author: skonto):
[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

What we need to verify is how things has changed for scala 2.12 and make this 
now failing.

[~lrytz] thoughts?

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if compiled/run with 2.11.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are no classes generated (should not be hard to fix).

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
>

[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569590#comment-16569590
 ] 

Sean Owen commented on SPARK-25029:
---

Yep let me go ahead and open a PR with some fixes to get this going.

For "TaskNotSeriaizable" – because in at least some of the cases it's clear 
that the connection to the outer class is unused, in theory something could 
clean that link from the closure. I think. That said I think it's OK to chalk 
this up to differences in how 2.12 compiles closures that might affect users, 
but not severely; as ever, it's a best practice to design code to capture only 
what's important anyway.

BytecodeUtilsSuite – yes, think these tests should be skipped in the context of 
LMF and 2.12

Good insight on ArrayType, yes I noticed it seems to come up with generating an 
encoder for arrays. I will try to figure out anything else I can though this 
much is a little outside my area.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at

[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588
 ] 

Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:52 PM:
--

[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

What we need to verify is how things has changed for scala 2.12 and make this 
now failing.

[~lrytz] thoughts?

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if compiled/run with 2.11.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are no classes generated (should not be hard to fix).


was (Author: skonto):
[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

What we need to verify is how things has changed for scala 2.12 and make this 
now failing.

[~lrytz] thoughts?

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if compiled/run with 2.11 or with a different 
version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are no classes generated (should not be hard to fix).

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at

[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588
 ] 

Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:43 PM:
--

[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

What we need to verify is how things has changed for scala 2.12 and make this 
now failing.

[~lrytz] thoughts?

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if compiled/run with 2.11 or with a different 
version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are no classes generated (should not be hard to fix).


was (Author: skonto):
[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

What we need to verify is how things has changed for scala 2.12 and make this 
now failing.

[~lrytz] thoughts?

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if compiled/run with 2.11 or with a different 
version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at

[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588
 ] 

Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:42 PM:
--

[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

What we need to verify is how things has changed for scala 2.12 and make this 
now failing.

[~lrytz] thoughts?

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if compiled/run with 2.11 or with a different 
version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.


was (Author: skonto):
[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

What we need to verify is how things change for scala 2.12 and make this fail.

[~lrytz] thoughts?

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if compiled/run with 2.11 or with a different 
version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at

[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588
 ] 

Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:37 PM:
--

[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

What we need to verify is how things change for scala 2.12 and make this fail.

[~lrytz] thoughts?

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if compiled/run with 2.11 or with a different 
version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.


was (Author: skonto):
[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if compiled/run with 2.11 or with a different 
version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
>

[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588
 ] 

Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:35 PM:
--

[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if compiled/run with 2.11 or with a different 
version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.


was (Author: skonto):
[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if run with 2.11 or with a different version of 
janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
>

[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588
 ] 

Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:35 PM:
--

[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compares if run with 2.11 or with a different version of 
janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.


was (Author: skonto):
[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compared with 2.11 or a different version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
>

[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588
 ] 

Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:34 PM:
--

[~srowen] Regarding the first one I don't think cleaning is invoked anywhere 
since the problem is of a class capturing the env (assertions), unless I am 
missing something. 

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compared with 2.11 or a different version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.


was (Author: skonto):
I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compared with 2.11 or a different version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter

[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588
 ] 

Stavros Kontopoulos commented on SPARK-25029:
-

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
val encoder = RowEncoder(schema)
encoder.toRow(Row(Array("a")))

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at

[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588
 ] 

Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:32 PM:
--

I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
 val encoder = RowEncoder(schema)
 encoder.toRow(Row(Array("a")))

I am curious how this compared with 2.11 or a different version of janino.

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.


was (Author: skonto):
I think the second one comes up with ArrayType only. 

I got the same code output (in spark shell) but couldnt find any size method 
too. What has a size method is the GenericRow and Array passed in the following 
example that reproduced this easily (RowEncoderSuite):

val schema = new StructType().add("a", ArrayType(TimestampType))
val encoder = RowEncoder(schema)
encoder.toRow(Row(Array("a")))

BytecodeUtilsSuite is also broken because it uses asm to check class context, 
but with lambdas there are not classes generated.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at

[jira] [Assigned] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core

2018-08-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25019:


Assignee: Apache Spark

> The published spark sql pom does not exclude the normal version of orc-core 
> 
>
> Key: SPARK-25019
> URL: https://issues.apache.org/jira/browse/SPARK-25019
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.4.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Critical
>
> I noticed that 
> [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom]
>  does not exclude the normal version of orc-core. Comparing with 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108]
>  and 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,]
>  we only exclude the normal version of orc-core in the parent pom. So, the 
> problem is that if a developer depends on spark-sql-core directly, orc-core 
> and orc-core-nohive will be in the dependency list. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core

2018-08-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25019:


Assignee: (was: Apache Spark)

> The published spark sql pom does not exclude the normal version of orc-core 
> 
>
> Key: SPARK-25019
> URL: https://issues.apache.org/jira/browse/SPARK-25019
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.4.0
>Reporter: Yin Huai
>Priority: Critical
>
> I noticed that 
> [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom]
>  does not exclude the normal version of orc-core. Comparing with 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108]
>  and 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,]
>  we only exclude the normal version of orc-core in the parent pom. So, the 
> problem is that if a developer depends on spark-sql-core directly, orc-core 
> and orc-core-nohive will be in the dependency list. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core

2018-08-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569587#comment-16569587
 ] 

Apache Spark commented on SPARK-25019:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22003

> The published spark sql pom does not exclude the normal version of orc-core 
> 
>
> Key: SPARK-25019
> URL: https://issues.apache.org/jira/browse/SPARK-25019
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.4.0
>Reporter: Yin Huai
>Priority: Critical
>
> I noticed that 
> [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom]
>  does not exclude the normal version of orc-core. Comparing with 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108]
>  and 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,]
>  we only exclude the normal version of orc-core in the parent pom. So, the 
> problem is that if a developer depends on spark-sql-core directly, orc-core 
> and orc-core-nohive will be in the dependency list. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569579#comment-16569579
 ] 

Sean Owen commented on SPARK-25029:
---

[~skonto] [~kiszk] [~ueshin] I thought you might be interested in the second 
problem, with Janino. Still investigating but I'm kind of stumped on what could 
even be causing it. Is it worth asking janino folks?

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070)
> at

[jira] [Created] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Sean Owen (JIRA)

Sean Owen created SPARK-25029:
-

 Summary: Scala 2.12 issues: TaskNotSerializable and Janino "Two 
non-abstract methods ..." errors
 Key: SPARK-25029
 URL: https://issues.apache.org/jira/browse/SPARK-25029
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0
Reporter: Sean Owen


We actually still have some test failures in the Scala 2.12 build. There seem 
to be two types. First are that some tests fail with "TaskNotSerializable" 
because some code construct now captures a reference to scalatest's 
AssertionHelper. Example:
{code:java}
- LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
*** FAILED *** java.io.NotSerializableException: 
org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
These seem generally easy to fix by tweaking the test code. It's not clear if 
something about closure cleaning in 2.12 could be improved to detect this 
situation automatically; given that yet only a handful of tests fail for this 
reason, it's unlikely to be a systemic problem.

 

The other error is curioser. Janino fails to compile generate code in many 
cases with errors like:
{code:java}
- encode/decode for seq of string: List(abc, xyz) *** FAILED ***
java.lang.RuntimeException: Error while encoding: 
org.codehaus.janino.InternalCompilerException: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Two 
non-abstract methods "public int scala.collection.TraversableOnce.size()" have 
the same parameter types, declaring type and return type{code}
 

I include the full generated code that failed in one case below. There is no 
{{size()}} in the generated code. It's got to be down to some difference in 
Scala 2.12, potentially even a Janino problem.

 
{code:java}
Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
"GeneratedClass": Two non-abstract methods "public int 
scala.collection.TraversableOnce.size()" have the same parameter types, 
declaring type and return type

at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)

at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)

at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)

at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)

at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)

at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)

at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)

at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)

... 30 more

Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
methods "public int scala.collection.TraversableOnce.size()" have the same 
parameter types, declaring type and return type

at 
org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)

at 
org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)

at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)

at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)

at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)

at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)

at 
org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)

at 
org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)

at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)

at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)

at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)

at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)

at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)

at 
org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094)

at 
org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070)

at org.codehaus.janino.Java$ConditionalExpression.accept(Java.java:4344)

at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)

at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)

at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2559)

at org.codehaus.janino.UnitCompiler.access$2700(UnitCompiler.java:212)

at 
org.codehaus.janino.UnitCompiler$6.visitLocalVariableDeclarationStatement(UnitCompiler.java:1482)

at 
org.codehaus.janino.UnitCompiler$6.visitLocalVariableDeclarationStatement(UnitCompiler.java:1466)

at

[jira] [Comment Edited] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core

2018-08-05 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569571#comment-16569571
 ] 

Dongjoon Hyun edited comment on SPARK-25019 at 8/5/18 9:04 PM:
---

I'll make a PR soon. There is no reason that the parent pom and sql/core/pom 
are different. While removing new dependency, the inheritance of depedency 
exclusion was broken before.


was (Author: dongjoon):
I'll make a PR soon. There is no reason that the parent pom and sql/core/pom 
are different. While removing new dependency, the inheritance of depedency 
exclusion is broken.

> The published spark sql pom does not exclude the normal version of orc-core 
> 
>
> Key: SPARK-25019
> URL: https://issues.apache.org/jira/browse/SPARK-25019
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.4.0
>Reporter: Yin Huai
>Priority: Critical
>
> I noticed that 
> [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom]
>  does not exclude the normal version of orc-core. Comparing with 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108]
>  and 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,]
>  we only exclude the normal version of orc-core in the parent pom. So, the 
> problem is that if a developer depends on spark-sql-core directly, orc-core 
> and orc-core-nohive will be in the dependency list. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core

2018-08-05 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569571#comment-16569571
 ] 

Dongjoon Hyun commented on SPARK-25019:
---

I'll make a PR soon. There is no reason that the parent pom and sql/core/pom 
are different. While removing new dependency, the inheritance of depedency 
exclusion is broken.

> The published spark sql pom does not exclude the normal version of orc-core 
> 
>
> Key: SPARK-25019
> URL: https://issues.apache.org/jira/browse/SPARK-25019
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.4.0
>Reporter: Yin Huai
>Priority: Critical
>
> I noticed that 
> [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom]
>  does not exclude the normal version of orc-core. Comparing with 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108]
>  and 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,]
>  we only exclude the normal version of orc-core in the parent pom. So, the 
> problem is that if a developer depends on spark-sql-core directly, orc-core 
> and orc-core-nohive will be in the dependency list. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23772) Provide an option to ignore column of all null values or empty map/array during JSON schema inference

2018-08-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569566#comment-16569566
 ] 

Apache Spark commented on SPARK-23772:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22002

> Provide an option to ignore column of all null values or empty map/array 
> during JSON schema inference
> -
>
> Key: SPARK-23772
> URL: https://issues.apache.org/jira/browse/SPARK-23772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>
> It is common that we convert data from JSON source to structured format 
> periodically. In the initial batch of JSON data, if a field's values are 
> always null, Spark infers this field as StringType. However, in the second 
> batch, one non-null value appears in this field and its type turns out to be 
> not StringType. Then merge schema failed because schema inconsistency.
> This also applies to empty arrays and empty objects. My proposal is providing 
> an option in Spark JSON source to omit those fields until we see a non-null 
> value.
> This is similar to SPARK-12436 but the proposed solution is different.
> cc: [~rxin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core

2018-08-05 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569526#comment-16569526
 ] 

Dongjoon Hyun commented on SPARK-25019:
---

Sure, I'll take a look, [~yhuai] .

> The published spark sql pom does not exclude the normal version of orc-core 
> 
>
> Key: SPARK-25019
> URL: https://issues.apache.org/jira/browse/SPARK-25019
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.4.0
>Reporter: Yin Huai
>Priority: Critical
>
> I noticed that 
> [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom]
>  does not exclude the normal version of orc-core. Comparing with 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108]
>  and 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,]
>  we only exclude the normal version of orc-core in the parent pom. So, the 
> problem is that if a developer depends on spark-sql-core directly, orc-core 
> and orc-core-nohive will be in the dependency list. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24819) Fail fast when no enough slots to launch the barrier stage on job submitted

2018-08-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24819:


Assignee: Apache Spark

> Fail fast when no enough slots to launch the barrier stage on job submitted
> ---
>
> Key: SPARK-24819
> URL: https://issues.apache.org/jira/browse/SPARK-24819
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Major
>
> Check all the barrier stages on job submitted, to see whether the barrier 
> stages require more slots (to be able to launch all the barrier tasks in the 
> same stage together) than currently active slots in the cluster. If the job 
> requires more slots than available (both busy and free slots), fail the job 
> on submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24819) Fail fast when no enough slots to launch the barrier stage on job submitted

2018-08-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24819:


Assignee: (was: Apache Spark)

> Fail fast when no enough slots to launch the barrier stage on job submitted
> ---
>
> Key: SPARK-24819
> URL: https://issues.apache.org/jira/browse/SPARK-24819
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Check all the barrier stages on job submitted, to see whether the barrier 
> stages require more slots (to be able to launch all the barrier tasks in the 
> same stage together) than currently active slots in the cluster. If the job 
> requires more slots than available (both busy and free slots), fail the job 
> on submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24819) Fail fast when no enough slots to launch the barrier stage on job submitted

2018-08-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569509#comment-16569509
 ] 

Apache Spark commented on SPARK-24819:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/22001

> Fail fast when no enough slots to launch the barrier stage on job submitted
> ---
>
> Key: SPARK-24819
> URL: https://issues.apache.org/jira/browse/SPARK-24819
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Check all the barrier stages on job submitted, to see whether the barrier 
> stages require more slots (to be able to launch all the barrier tasks in the 
> same stage together) than currently active slots in the cluster. If the job 
> requires more slots than available (both busy and free slots), fail the job 
> on submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors

2018-08-05 Thread Cody Koeninger (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569491#comment-16569491
 ] 

Cody Koeninger commented on SPARK-24987:


It was merged to branch-2.3

https://github.com/apache/spark/commits/branch-2.3



> Kafka Cached Consumer Leaking File Descriptors
> --
>
> Key: SPARK-24987
> URL: https://issues.apache.org/jira/browse/SPARK-24987
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)
>  
>Reporter: Yuval Itzchakov
>Assignee: Yuval Itzchakov
>Priority: Critical
> Fix For: 2.4.0
>
>
> Setup:
>  * Spark 2.3.1
>  * Java 1.8.0 (112)
>  * Standalone Cluster Manager
>  * 3 Nodes, 1 Executor per node.
> Spark 2.3.0 introduced a new mechanism for caching Kafka consumers 
> (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel)
>  via KafkaDataConsumer.acquire.
> It seems that there are situations (I've been trying to debug it, haven't 
> been able to find the root cause as of yet) where cached consumers remain "in 
> use" throughout the life time of the task and are never released. This can be 
> identified by the following line of the stack trace:
> at 
> org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460)
> Which points to:
> {code:java}
> } else if (existingInternalConsumer.inUse) {
>   // If consumer is already cached but is currently in use, then return a new 
> consumer
>   NonCachedKafkaDataConsumer(newInternalConsumer)
> {code}
>  Meaning the existing consumer created for that `TopicPartition` is still in 
> use for some reason. The weird thing is that you can see this for very old 
> tasks which have already finished successfully.
> I've traced down this leak using file leak detector, attaching it to the 
> running Executor JVM process. I've emitted the list of open file descriptors 
> which [you can find 
> here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d],
>  and you can see that the majority of them are epoll FD used by Kafka 
> Consumers, indicating that they aren't closing.
>  Spark graph:
> {code:java}
> kafkaStream
>   .load()
>   .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   .as[(String, String)]
>   .flatMap {...}
>   .groupByKey(...)
>   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...)
>   .foreach(...)
>   .outputMode(OutputMode.Update)
>   .option("checkpointLocation",
> sparkConfiguration.properties.checkpointDirectory)
>   .start()
>   .awaitTermination(){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25026) Binary releases should contain some copy of compiled external integration modules

2018-08-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25026:
--
Summary: Binary releases should contain some copy of compiled external 
integration modules  (was: Binary releases don't contain Kafka integration 
modules)

Taking this a slightly different direction: what about including the binary 
compiled modules in an "external-jars/' dir in the binary release? right now, 
these modules aren't in the binary release at all, which seems odd for a full 
binary release of Spark. That much wouldn't entail any behavior change at all.

> Binary releases should contain some copy of compiled external integration 
> modules
> -
>
> Key: SPARK-25026
> URL: https://issues.apache.org/jira/browse/SPARK-25026
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster

2018-08-05 Thread rajanimaski (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569460#comment-16569460
 ] 

rajanimaski commented on SPARK-20696:
-

[~adiubc]

 Here's the custom code developed for scala kmeans that extends spark mllib 
libraries. Note that some methods duplicated from existing spark kmean because 
they were not extendable(private in other words)

[https://github.com/rajanim/selective-search/tree/master/src/main/scala/org/sfsu/cs/clustering/kmeans]

>From what I have heard, the pyspark's kmean works as intended. So I am 
>assuming that it has its own kmean implementation and not calling/utilizing 
>scala's kmean libraries.

> tf-idf document clustering with K-means in Apache Spark putting points into 
> one cluster
> ---
>
> Key: SPARK-20696
> URL: https://issues.apache.org/jira/browse/SPARK-20696
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Nassir
>Priority: Major
>
> I am trying to do the classic job of clustering text documents by 
> pre-processing, generating tf-idf matrix, and then applying K-means. However, 
> testing this workflow on the classic 20NewsGroup dataset results in most 
> documents being clustered into one cluster. (I have initially tried to 
> cluster all documents from 6 of the 20 groups - so expecting to cluster into 
> 6 clusters).
> I am implementing this in Apache Spark as my purpose is to utilise this 
> technique on millions of documents. Here is the code written in Pyspark on 
> Databricks:
> #declare path to folder containing 6 of 20 news group categories
> path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % 
> MOUNT_NAME
> #read all the text files from the 6 folders. Each entity is an entire 
> document. 
> text_files = sc.wholeTextFiles(path).cache()
> #convert rdd to dataframe
> df = text_files.toDF(["filePath", "document"]).cache()
> from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 
> #tokenize the document text
> tokenizer = Tokenizer(inputCol="document", outputCol="tokens")
> tokenized = tokenizer.transform(df).cache()
> from pyspark.ml.feature import StopWordsRemover
> remover = StopWordsRemover(inputCol="tokens", 
> outputCol="stopWordsRemovedTokens")
> stopWordsRemoved_df = remover.transform(tokenized).cache()
> hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", 
> outputCol="rawFeatures", numFeatures=20)
> tfVectors = hashingTF.transform(stopWordsRemoved_df).cache()
> idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
> idfModel = idf.fit(tfVectors)
> tfIdfVectors = idfModel.transform(tfVectors).cache()
> #note that I have also tried to use normalized data, but get the same result
> from pyspark.ml.feature import Normalizer
> from pyspark.ml.linalg import Vectors
> normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
> l2NormData = normalizer.transform(tfIdfVectors)
> from pyspark.ml.clustering import KMeans
> # Trains a KMeans model.
> kmeans = KMeans().setK(6).setMaxIter(20)
> km_model = kmeans.fit(l2NormData)
> clustersTable = km_model.transform(l2NormData)
> [output showing most documents get clustered into cluster 0][1]
> ID number_of_documents_in_cluster 
> 0 3024 
> 3 5 
> 1 3 
> 5 2
> 2 2 
> 4 1
> As you can see most of my data points get clustered into cluster 0, and I 
> cannot figure out what I am doing wrong as all the tutorials and code I have 
> come across online point to using this method.
> In addition I have also tried normalizing the tf-idf matrix before K-means 
> but that also produces the same result. I know cosine distance is a better 
> measure to use, but I expected using standard K-means in Apache Spark would 
> provide meaningful results.
> Can anyone help with regards to whether I have a bug in my code, or if 
> something is missing in my data clustering pipeline?
> (Question also asked in Stackoverflow before: 
> http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one)
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors

2018-08-05 Thread Yuval Itzchakov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569432#comment-16569432
 ] 

Yuval Itzchakov edited comment on SPARK-24987 at 8/5/18 10:58 AM:
--

[~c...@koeninger.org] Is there any chance this will make it in time for 2.3.2? 
This is a critical fix for us (or more generally for all users of the Kafka 
source).


was (Author: yuval.itzchakov):
[~c...@koeninger.org] Is there any chance this will make it in time for 2.3.2? 
This is a critical fix for us.

> Kafka Cached Consumer Leaking File Descriptors
> --
>
> Key: SPARK-24987
> URL: https://issues.apache.org/jira/browse/SPARK-24987
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)
>  
>Reporter: Yuval Itzchakov
>Assignee: Yuval Itzchakov
>Priority: Critical
> Fix For: 2.4.0
>
>
> Setup:
>  * Spark 2.3.1
>  * Java 1.8.0 (112)
>  * Standalone Cluster Manager
>  * 3 Nodes, 1 Executor per node.
> Spark 2.3.0 introduced a new mechanism for caching Kafka consumers 
> (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel)
>  via KafkaDataConsumer.acquire.
> It seems that there are situations (I've been trying to debug it, haven't 
> been able to find the root cause as of yet) where cached consumers remain "in 
> use" throughout the life time of the task and are never released. This can be 
> identified by the following line of the stack trace:
> at 
> org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460)
> Which points to:
> {code:java}
> } else if (existingInternalConsumer.inUse) {
>   // If consumer is already cached but is currently in use, then return a new 
> consumer
>   NonCachedKafkaDataConsumer(newInternalConsumer)
> {code}
>  Meaning the existing consumer created for that `TopicPartition` is still in 
> use for some reason. The weird thing is that you can see this for very old 
> tasks which have already finished successfully.
> I've traced down this leak using file leak detector, attaching it to the 
> running Executor JVM process. I've emitted the list of open file descriptors 
> which [you can find 
> here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d],
>  and you can see that the majority of them are epoll FD used by Kafka 
> Consumers, indicating that they aren't closing.
>  Spark graph:
> {code:java}
> kafkaStream
>   .load()
>   .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   .as[(String, String)]
>   .flatMap {...}
>   .groupByKey(...)
>   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...)
>   .foreach(...)
>   .outputMode(OutputMode.Update)
>   .option("checkpointLocation",
> sparkConfiguration.properties.checkpointDirectory)
>   .start()
>   .awaitTermination(){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors

2018-08-05 Thread Yuval Itzchakov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuval Itzchakov updated SPARK-24987:

Shepherd:   (was: Tathagata Das)

> Kafka Cached Consumer Leaking File Descriptors
> --
>
> Key: SPARK-24987
> URL: https://issues.apache.org/jira/browse/SPARK-24987
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)
>  
>Reporter: Yuval Itzchakov
>Assignee: Yuval Itzchakov
>Priority: Critical
> Fix For: 2.4.0
>
>
> Setup:
>  * Spark 2.3.1
>  * Java 1.8.0 (112)
>  * Standalone Cluster Manager
>  * 3 Nodes, 1 Executor per node.
> Spark 2.3.0 introduced a new mechanism for caching Kafka consumers 
> (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel)
>  via KafkaDataConsumer.acquire.
> It seems that there are situations (I've been trying to debug it, haven't 
> been able to find the root cause as of yet) where cached consumers remain "in 
> use" throughout the life time of the task and are never released. This can be 
> identified by the following line of the stack trace:
> at 
> org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460)
> Which points to:
> {code:java}
> } else if (existingInternalConsumer.inUse) {
>   // If consumer is already cached but is currently in use, then return a new 
> consumer
>   NonCachedKafkaDataConsumer(newInternalConsumer)
> {code}
>  Meaning the existing consumer created for that `TopicPartition` is still in 
> use for some reason. The weird thing is that you can see this for very old 
> tasks which have already finished successfully.
> I've traced down this leak using file leak detector, attaching it to the 
> running Executor JVM process. I've emitted the list of open file descriptors 
> which [you can find 
> here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d],
>  and you can see that the majority of them are epoll FD used by Kafka 
> Consumers, indicating that they aren't closing.
>  Spark graph:
> {code:java}
> kafkaStream
>   .load()
>   .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   .as[(String, String)]
>   .flatMap {...}
>   .groupByKey(...)
>   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...)
>   .foreach(...)
>   .outputMode(OutputMode.Update)
>   .option("checkpointLocation",
> sparkConfiguration.properties.checkpointDirectory)
>   .start()
>   .awaitTermination(){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors

2018-08-05 Thread Yuval Itzchakov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569432#comment-16569432
 ] 

Yuval Itzchakov edited comment on SPARK-24987 at 8/5/18 10:54 AM:
--

[~c...@koeninger.org] Is there any chance this will make it in time for 2.3.2? 
This is a critical fix for us.


was (Author: yuval.itzchakov):
Is there any chance this will make it in time for 2.3.2? This is a critical fix 
for us.

> Kafka Cached Consumer Leaking File Descriptors
> --
>
> Key: SPARK-24987
> URL: https://issues.apache.org/jira/browse/SPARK-24987
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)
>  
>Reporter: Yuval Itzchakov
>Assignee: Yuval Itzchakov
>Priority: Critical
> Fix For: 2.4.0
>
>
> Setup:
>  * Spark 2.3.1
>  * Java 1.8.0 (112)
>  * Standalone Cluster Manager
>  * 3 Nodes, 1 Executor per node.
> Spark 2.3.0 introduced a new mechanism for caching Kafka consumers 
> (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel)
>  via KafkaDataConsumer.acquire.
> It seems that there are situations (I've been trying to debug it, haven't 
> been able to find the root cause as of yet) where cached consumers remain "in 
> use" throughout the life time of the task and are never released. This can be 
> identified by the following line of the stack trace:
> at 
> org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460)
> Which points to:
> {code:java}
> } else if (existingInternalConsumer.inUse) {
>   // If consumer is already cached but is currently in use, then return a new 
> consumer
>   NonCachedKafkaDataConsumer(newInternalConsumer)
> {code}
>  Meaning the existing consumer created for that `TopicPartition` is still in 
> use for some reason. The weird thing is that you can see this for very old 
> tasks which have already finished successfully.
> I've traced down this leak using file leak detector, attaching it to the 
> running Executor JVM process. I've emitted the list of open file descriptors 
> which [you can find 
> here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d],
>  and you can see that the majority of them are epoll FD used by Kafka 
> Consumers, indicating that they aren't closing.
>  Spark graph:
> {code:java}
> kafkaStream
>   .load()
>   .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   .as[(String, String)]
>   .flatMap {...}
>   .groupByKey(...)
>   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...)
>   .foreach(...)
>   .outputMode(OutputMode.Update)
>   .option("checkpointLocation",
> sparkConfiguration.properties.checkpointDirectory)
>   .start()
>   .awaitTermination(){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors

2018-08-05 Thread Yuval Itzchakov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569432#comment-16569432
 ] 

Yuval Itzchakov commented on SPARK-24987:
-

Is there any chance this will make it in time for 2.3.2? This is a critical fix 
for us.

> Kafka Cached Consumer Leaking File Descriptors
> --
>
> Key: SPARK-24987
> URL: https://issues.apache.org/jira/browse/SPARK-24987
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)
>  
>Reporter: Yuval Itzchakov
>Assignee: Yuval Itzchakov
>Priority: Critical
> Fix For: 2.4.0
>
>
> Setup:
>  * Spark 2.3.1
>  * Java 1.8.0 (112)
>  * Standalone Cluster Manager
>  * 3 Nodes, 1 Executor per node.
> Spark 2.3.0 introduced a new mechanism for caching Kafka consumers 
> (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel)
>  via KafkaDataConsumer.acquire.
> It seems that there are situations (I've been trying to debug it, haven't 
> been able to find the root cause as of yet) where cached consumers remain "in 
> use" throughout the life time of the task and are never released. This can be 
> identified by the following line of the stack trace:
> at 
> org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460)
> Which points to:
> {code:java}
> } else if (existingInternalConsumer.inUse) {
>   // If consumer is already cached but is currently in use, then return a new 
> consumer
>   NonCachedKafkaDataConsumer(newInternalConsumer)
> {code}
>  Meaning the existing consumer created for that `TopicPartition` is still in 
> use for some reason. The weird thing is that you can see this for very old 
> tasks which have already finished successfully.
> I've traced down this leak using file leak detector, attaching it to the 
> running Executor JVM process. I've emitted the list of open file descriptors 
> which [you can find 
> here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d],
>  and you can see that the majority of them are epoll FD used by Kafka 
> Consumers, indicating that they aren't closing.
>  Spark graph:
> {code:java}
> kafkaStream
>   .load()
>   .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   .as[(String, String)]
>   .flatMap {...}
>   .groupByKey(...)
>   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...)
>   .foreach(...)
>   .outputMode(OutputMode.Update)
>   .option("checkpointLocation",
> sparkConfiguration.properties.checkpointDirectory)
>   .start()
>   .awaitTermination(){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null

2018-08-05 Thread Izek Greenfield (JIRA)

Izek Greenfield created SPARK-25028:
---

 Summary: AnalyzePartitionCommand failed with NPE if value is null
 Key: SPARK-25028
 URL: https://issues.apache.org/jira/browse/SPARK-25028
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Izek Greenfield


on line 143: val partitionColumnValues = 
partitionColumns.indices.map(r.get(_).toString)

if the value is NULL the code will fail with NPE



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23939) High-order function: transform_keys(map, function) → map

2018-08-05 Thread Neha Patil (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569397#comment-16569397
 ] 

Neha Patil commented on SPARK-23939:


I am working on this one, will have PR ready in a couple of hours.

> High-order function: transform_keys(map, function) → 
> map
> 
>
> Key: SPARK-23939
> URL: https://issues.apache.org/jira/browse/SPARK-23939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> keys.
> {noformat}
> SELECT transform_keys(MAP(ARRAY[], ARRAY[]), (k, v) -> k + 1); -- {}
> SELECT transform_keys(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) -> 
> k + 1); -- {2 -> a, 3 -> b, 4 -> c}
> SELECT transform_keys(MAP(ARRAY ['a', 'b', 'c'], ARRAY [1, 2, 3]), (k, v) -> 
> v * v); -- {1 -> 1, 4 -> 2, 9 -> 3}
> SELECT transform_keys(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a1 -> 1, b2 -> 2}
> SELECT transform_keys(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {one -> 1.0, 
> two -> 1.4}
>   (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k]);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

41 matches

Mail list logo