[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569675#comment-16569675 ] Sean Owen commented on SPARK-25029: --- [~kiszk] and/or [~lrytz] I wonder if this could be related ... if so would be due to how methods defined in traits are now encoded in interfaces? https://github.com/janino-compiler/janino/issues/47#issuecomment-410574546 > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391) > at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094) > at > org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070) > at
[jira] [Commented] (SPARK-20592) Alter table concatenate is not working as expected.
[ https://issues.apache.org/jira/browse/SPARK-20592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569673#comment-16569673 ] Yuming Wang commented on SPARK-20592: - Spark doesn't support this command: [https://github.com/apache/spark/blob/73dd6cf9b558f9d752e1f3c13584344257ad7863/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L217] > Alter table concatenate is not working as expected. > --- > > Key: SPARK-20592 > URL: https://issues.apache.org/jira/browse/SPARK-20592 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0, 2.2.1, 2.3.1 >Reporter: Guru Prabhakar Reddy Marthala >Priority: Major > Labels: hive, pyspark > > Created a table using CTAS from csv to parquet.Parquet table generated > numerous small files.tried alter table concatenate but it's not working as > expected. > spark.sql("CREATE TABLE flight.flight_data(year INT, month INT, day INT, > day_of_week INT, dep_time INT, crs_dep_time INT, arr_time INT, > crs_arr_time INT, unique_carrier STRING, flight_num INT, tail_num > STRING, actual_elapsed_time INT, crs_elapsed_time INT, air_time INT, > arr_delay INT, dep_delay INT, origin STRING, dest STRING, distance > INT, taxi_in INT, taxi_out INT, cancelled INT, cancellation_code > STRING, diverted INT, carrier_delay STRING, weather_delay STRING, > nas_delay STRING, security_delay STRING, late_aircraft_delay STRING) ROW > FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile") > spark.sql("load data local INPATH 'i:/2008/2008.csv' INTO TABLE > flight.flight_data") > spark.sql("create table flight.flight_data_pq stored as parquet as select * > from flight.flight_data") > spark.sql("create table flight.flight_data_orc stored as orc as select * from > flight.flight_data") > pyspark.sql.utils.ParseException: u'\nOperation not allowed: alter table > concatenate(line 1, pos 0)\n\n== SQL ==\nalter table > flight_data.flight_data_pq concatenate\n^^^\n' > Tried on both orc and parquet format.It's not working. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569621#comment-16569621 ] Kazuaki Ishizaki commented on SPARK-25029: -- [~srowen] I see. The following parts have the method. I will try to see it. My first feeling is that the problem may be in the scala collection library or catalyst Java code generator. {code} ... /* 146 */ final int length_1 = MapObjects_loopValue140.size(); ... /* 315 */ final int length_0 = MapObjects_loopValue140.size(); ... {code} > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391) > at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212) > at >
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569602#comment-16569602 ] Stavros Kontopoulos commented on SPARK-25029: - [~srowen] I think the reason for this is similar to what I describe here [https://github.com/apache/spark/pull/22004/files#r207753682] An analysis of the lambda body could clean this. This is not captured in the design doc but we can add it later. Part missing is access all fields of the lambda and do the cleaning for any referenced object if possible. [~lrytz] I guess this is possible right? > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391) > at
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569600#comment-16569600 ] Sean Owen commented on SPARK-25029: --- One of the 'task not serializable' issues is proving trickier: {code:java} [ERROR] runDTUsingStaticMethods(org.apache.spark.mllib.tree.JavaDecisionTreeSuite) Time elapsed: 0.375 s <<< ERROR! org.apache.spark.SparkException: Task not serializable at org.apache.spark.mllib.tree.JavaDecisionTreeSuite.runDTUsingStaticMethods(JavaDecisionTreeSuite.java:81) Caused by: java.io.NotSerializableException: org.apache.spark.ml.tree.impl.RandomForest$ Serialization stack: - object not serializable (class: org.apache.spark.ml.tree.impl.RandomForest$, value: org.apache.spark.ml.tree.impl.RandomForest$@74899df1) - element of array (index: 0) - array (class [Ljava.lang.Object;, size 7) - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;) - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.ml.tree.impl.RandomForest$, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/ml/tree/impl/RandomForest$.$anonfun$findBestSplits$21:(Lorg/apache/spark/ml/tree/impl/RandomForest$;Lorg/apache/spark/ml/tree/impl/DecisionTreeMetadata;Lscala/collection/immutable/Map;Lscala/collection/immutable/Map;[[Lorg/apache/spark/ml/tree/Split;ILorg/apache/spark/broadcast/Broadcast;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=7]) - writeReplace data (class: java.lang.invoke.SerializedLambda) - object (class org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$2726/1630334286, org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$2726/1630334286@17e04cc5) at org.apache.spark.mllib.tree.JavaDecisionTreeSuite.runDTUsingStaticMethods(JavaDecisionTreeSuite.java:81){code} Looks like something in org.apache.spark.ml.tree.impl.RandomForest.findBestSplits is capturing the containing RandomForest class, which isn't serializable. It's just an object, with no fields, we can make it Serializable trivially to resolve that. But that one is less trivial a manifestation than the others. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at >
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569592#comment-16569592 ] Stavros Kontopoulos commented on SPARK-25029: - [~srowen] cool will wait for the PR! Regarding the TaskNotSerializable I agree. The specific example above though is not using a closure AFAIK. It is an ordinary class passed in some field. Could you point me to a failing test that uses a closure and needs cleaning (I havent run every possible build, I might be overlooking some) ? I might be able to debug then. Regarding the janino thing, it is outside my area too, still compilation takes into consideration Scala generated classes. Thus, my understanding is that it is not just about janino, it is also about how janino treats/sees scala generated classes when it tries to compile that code listed above in text. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at
[jira] [Assigned] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25029: Assignee: (was: Apache Spark) > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391) > at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094) > at > org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070) > at org.codehaus.janino.Java$ConditionalExpression.accept(Java.java:4344) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at
[jira] [Assigned] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25029: Assignee: Apache Spark > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391) > at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094) > at > org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070) > at org.codehaus.janino.Java$ConditionalExpression.accept(Java.java:4344) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569591#comment-16569591 ] Apache Spark commented on SPARK-25029: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/22004 > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391) > at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094) > at > org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070) > at org.codehaus.janino.Java$ConditionalExpression.accept(Java.java:4344) > at
[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588 ] Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 11:00 PM: -- [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. What we need to verify is how things has changed for scala 2.12 and make this now failing. [~lrytz] thoughts? I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if compiled/run with 2.11. BytecodeUtilsSuite is also broken because it uses asm to check class contents, but with lambdas there are no classes generated (should not be hard to fix). was (Author: skonto): [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. What we need to verify is how things has changed for scala 2.12 and make this now failing. [~lrytz] thoughts? I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if compiled/run with 2.11. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are no classes generated (should not be hard to fix). > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) >
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569590#comment-16569590 ] Sean Owen commented on SPARK-25029: --- Yep let me go ahead and open a PR with some fixes to get this going. For "TaskNotSeriaizable" – because in at least some of the cases it's clear that the connection to the outer class is unused, in theory something could clean that link from the closure. I think. That said I think it's OK to chalk this up to differences in how 2.12 compiles closures that might affect users, but not severely; as ever, it's a best practice to design code to capture only what's important anyway. BytecodeUtilsSuite – yes, think these tests should be skipped in the context of LMF and 2.12 Good insight on ArrayType, yes I noticed it seems to come up with generating an encoder for arrays. I will try to figure out anything else I can though this much is a little outside my area. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at
[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588 ] Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:52 PM: -- [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. What we need to verify is how things has changed for scala 2.12 and make this now failing. [~lrytz] thoughts? I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if compiled/run with 2.11. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are no classes generated (should not be hard to fix). was (Author: skonto): [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. What we need to verify is how things has changed for scala 2.12 and make this now failing. [~lrytz] thoughts? I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if compiled/run with 2.11 or with a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are no classes generated (should not be hard to fix). > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at
[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588 ] Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:43 PM: -- [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. What we need to verify is how things has changed for scala 2.12 and make this now failing. [~lrytz] thoughts? I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if compiled/run with 2.11 or with a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are no classes generated (should not be hard to fix). was (Author: skonto): [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. What we need to verify is how things has changed for scala 2.12 and make this now failing. [~lrytz] thoughts? I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if compiled/run with 2.11 or with a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at
[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588 ] Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:42 PM: -- [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. What we need to verify is how things has changed for scala 2.12 and make this now failing. [~lrytz] thoughts? I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if compiled/run with 2.11 or with a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. was (Author: skonto): [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. What we need to verify is how things change for scala 2.12 and make this fail. [~lrytz] thoughts? I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if compiled/run with 2.11 or with a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at
[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588 ] Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:37 PM: -- [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. What we need to verify is how things change for scala 2.12 and make this fail. [~lrytz] thoughts? I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if compiled/run with 2.11 or with a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. was (Author: skonto): [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if compiled/run with 2.11 or with a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at >
[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588 ] Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:35 PM: -- [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if compiled/run with 2.11 or with a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. was (Author: skonto): [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if run with 2.11 or with a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at >
[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588 ] Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:35 PM: -- [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compares if run with 2.11 or with a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. was (Author: skonto): [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compared with 2.11 or a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at >
[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588 ] Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:34 PM: -- [~srowen] Regarding the first one I don't think cleaning is invoked anywhere since the problem is of a class capturing the env (assertions), unless I am missing something. I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compared with 2.11 or a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. was (Author: skonto): I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compared with 2.11 or a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588 ] Stavros Kontopoulos commented on SPARK-25029: - I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) > at
[jira] [Comment Edited] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569588#comment-16569588 ] Stavros Kontopoulos edited comment on SPARK-25029 at 8/5/18 10:32 PM: -- I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) I am curious how this compared with 2.11 or a different version of janino. BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. was (Author: skonto): I think the second one comes up with ArrayType only. I got the same code output (in spark shell) but couldnt find any size method too. What has a size method is the GenericRow and Array passed in the following example that reproduced this easily (RowEncoderSuite): val schema = new StructType().add("a", ArrayType(TimestampType)) val encoder = RowEncoder(schema) encoder.toRow(Row(Array("a"))) BytecodeUtilsSuite is also broken because it uses asm to check class context, but with lambdas there are not classes generated. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at
[jira] [Assigned] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core
[ https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25019: Assignee: Apache Spark > The published spark sql pom does not exclude the normal version of orc-core > > > Key: SPARK-25019 > URL: https://issues.apache.org/jira/browse/SPARK-25019 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.4.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > > I noticed that > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom] > does not exclude the normal version of orc-core. Comparing with > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108] > and > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,] > we only exclude the normal version of orc-core in the parent pom. So, the > problem is that if a developer depends on spark-sql-core directly, orc-core > and orc-core-nohive will be in the dependency list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core
[ https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25019: Assignee: (was: Apache Spark) > The published spark sql pom does not exclude the normal version of orc-core > > > Key: SPARK-25019 > URL: https://issues.apache.org/jira/browse/SPARK-25019 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.4.0 >Reporter: Yin Huai >Priority: Critical > > I noticed that > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom] > does not exclude the normal version of orc-core. Comparing with > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108] > and > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,] > we only exclude the normal version of orc-core in the parent pom. So, the > problem is that if a developer depends on spark-sql-core directly, orc-core > and orc-core-nohive will be in the dependency list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core
[ https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569587#comment-16569587 ] Apache Spark commented on SPARK-25019: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22003 > The published spark sql pom does not exclude the normal version of orc-core > > > Key: SPARK-25019 > URL: https://issues.apache.org/jira/browse/SPARK-25019 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.4.0 >Reporter: Yin Huai >Priority: Critical > > I noticed that > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom] > does not exclude the normal version of orc-core. Comparing with > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108] > and > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,] > we only exclude the normal version of orc-core in the parent pom. So, the > problem is that if a developer depends on spark-sql-core directly, orc-core > and orc-core-nohive will be in the dependency list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569579#comment-16569579 ] Sean Owen commented on SPARK-25029: --- [~skonto] [~kiszk] [~ueshin] I thought you might be interested in the second problem, with Janino. Still investigating but I'm kind of stumped on what could even be causing it. Is it worth asking janino folks? > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391) > at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094) > at > org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070) > at
[jira] [Created] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
Sean Owen created SPARK-25029: - Summary: Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors Key: SPARK-25029 URL: https://issues.apache.org/jira/browse/SPARK-25029 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.0 Reporter: Sean Owen We actually still have some test failures in the Scala 2.12 build. There seem to be two types. First are that some tests fail with "TaskNotSerializable" because some code construct now captures a reference to scalatest's AssertionHelper. Example: {code:java} - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode *** FAILED *** java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not serializable (class: org.scalatest.Assertions$AssertionsHelper, value: org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} These seem generally easy to fix by tweaking the test code. It's not clear if something about closure cleaning in 2.12 could be improved to detect this situation automatically; given that yet only a handful of tests fail for this reason, it's unlikely to be a systemic problem. The other error is curioser. Janino fails to compile generate code in many cases with errors like: {code:java} - encode/decode for seq of string: List(abc, xyz) *** FAILED *** java.lang.RuntimeException: Error while encoding: org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Two non-abstract methods "public int scala.collection.TraversableOnce.size()" have the same parameter types, declaring type and return type{code} I include the full generated code that failed in one case below. There is no {{size()}} in the generated code. It's got to be down to some difference in Scala 2.12, potentially even a Janino problem. {code:java} Caused by: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Two non-abstract methods "public int scala.collection.TraversableOnce.size()" have the same parameter types, declaring type and return type at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) ... 30 more Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract methods "public int scala.collection.TraversableOnce.size()" have the same parameter types, declaring type and return type at org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) at org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) at org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) at org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391) at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212) at org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094) at org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070) at org.codehaus.janino.Java$ConditionalExpression.accept(Java.java:4344) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2559) at org.codehaus.janino.UnitCompiler.access$2700(UnitCompiler.java:212) at org.codehaus.janino.UnitCompiler$6.visitLocalVariableDeclarationStatement(UnitCompiler.java:1482) at org.codehaus.janino.UnitCompiler$6.visitLocalVariableDeclarationStatement(UnitCompiler.java:1466) at
[jira] [Comment Edited] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core
[ https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569571#comment-16569571 ] Dongjoon Hyun edited comment on SPARK-25019 at 8/5/18 9:04 PM: --- I'll make a PR soon. There is no reason that the parent pom and sql/core/pom are different. While removing new dependency, the inheritance of depedency exclusion was broken before. was (Author: dongjoon): I'll make a PR soon. There is no reason that the parent pom and sql/core/pom are different. While removing new dependency, the inheritance of depedency exclusion is broken. > The published spark sql pom does not exclude the normal version of orc-core > > > Key: SPARK-25019 > URL: https://issues.apache.org/jira/browse/SPARK-25019 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.4.0 >Reporter: Yin Huai >Priority: Critical > > I noticed that > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom] > does not exclude the normal version of orc-core. Comparing with > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108] > and > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,] > we only exclude the normal version of orc-core in the parent pom. So, the > problem is that if a developer depends on spark-sql-core directly, orc-core > and orc-core-nohive will be in the dependency list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core
[ https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569571#comment-16569571 ] Dongjoon Hyun commented on SPARK-25019: --- I'll make a PR soon. There is no reason that the parent pom and sql/core/pom are different. While removing new dependency, the inheritance of depedency exclusion is broken. > The published spark sql pom does not exclude the normal version of orc-core > > > Key: SPARK-25019 > URL: https://issues.apache.org/jira/browse/SPARK-25019 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.4.0 >Reporter: Yin Huai >Priority: Critical > > I noticed that > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom] > does not exclude the normal version of orc-core. Comparing with > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108] > and > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,] > we only exclude the normal version of orc-core in the parent pom. So, the > problem is that if a developer depends on spark-sql-core directly, orc-core > and orc-core-nohive will be in the dependency list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23772) Provide an option to ignore column of all null values or empty map/array during JSON schema inference
[ https://issues.apache.org/jira/browse/SPARK-23772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569566#comment-16569566 ] Apache Spark commented on SPARK-23772: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22002 > Provide an option to ignore column of all null values or empty map/array > during JSON schema inference > - > > Key: SPARK-23772 > URL: https://issues.apache.org/jira/browse/SPARK-23772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.4.0 > > > It is common that we convert data from JSON source to structured format > periodically. In the initial batch of JSON data, if a field's values are > always null, Spark infers this field as StringType. However, in the second > batch, one non-null value appears in this field and its type turns out to be > not StringType. Then merge schema failed because schema inconsistency. > This also applies to empty arrays and empty objects. My proposal is providing > an option in Spark JSON source to omit those fields until we see a non-null > value. > This is similar to SPARK-12436 but the proposed solution is different. > cc: [~rxin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core
[ https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569526#comment-16569526 ] Dongjoon Hyun commented on SPARK-25019: --- Sure, I'll take a look, [~yhuai] . > The published spark sql pom does not exclude the normal version of orc-core > > > Key: SPARK-25019 > URL: https://issues.apache.org/jira/browse/SPARK-25019 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.4.0 >Reporter: Yin Huai >Priority: Critical > > I noticed that > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom] > does not exclude the normal version of orc-core. Comparing with > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108] > and > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,] > we only exclude the normal version of orc-core in the parent pom. So, the > problem is that if a developer depends on spark-sql-core directly, orc-core > and orc-core-nohive will be in the dependency list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24819) Fail fast when no enough slots to launch the barrier stage on job submitted
[ https://issues.apache.org/jira/browse/SPARK-24819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24819: Assignee: Apache Spark > Fail fast when no enough slots to launch the barrier stage on job submitted > --- > > Key: SPARK-24819 > URL: https://issues.apache.org/jira/browse/SPARK-24819 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Apache Spark >Priority: Major > > Check all the barrier stages on job submitted, to see whether the barrier > stages require more slots (to be able to launch all the barrier tasks in the > same stage together) than currently active slots in the cluster. If the job > requires more slots than available (both busy and free slots), fail the job > on submit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24819) Fail fast when no enough slots to launch the barrier stage on job submitted
[ https://issues.apache.org/jira/browse/SPARK-24819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24819: Assignee: (was: Apache Spark) > Fail fast when no enough slots to launch the barrier stage on job submitted > --- > > Key: SPARK-24819 > URL: https://issues.apache.org/jira/browse/SPARK-24819 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Check all the barrier stages on job submitted, to see whether the barrier > stages require more slots (to be able to launch all the barrier tasks in the > same stage together) than currently active slots in the cluster. If the job > requires more slots than available (both busy and free slots), fail the job > on submit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24819) Fail fast when no enough slots to launch the barrier stage on job submitted
[ https://issues.apache.org/jira/browse/SPARK-24819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569509#comment-16569509 ] Apache Spark commented on SPARK-24819: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/22001 > Fail fast when no enough slots to launch the barrier stage on job submitted > --- > > Key: SPARK-24819 > URL: https://issues.apache.org/jira/browse/SPARK-24819 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Check all the barrier stages on job submitted, to see whether the barrier > stages require more slots (to be able to launch all the barrier tasks in the > same stage together) than currently active slots in the cluster. If the job > requires more slots than available (both busy and free slots), fail the job > on submit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors
[ https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569491#comment-16569491 ] Cody Koeninger commented on SPARK-24987: It was merged to branch-2.3 https://github.com/apache/spark/commits/branch-2.3 > Kafka Cached Consumer Leaking File Descriptors > -- > > Key: SPARK-24987 > URL: https://issues.apache.org/jira/browse/SPARK-24987 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 > Environment: Spark 2.3.1 > Java(TM) SE Runtime Environment (build 1.8.0_112-b15) > Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode) > >Reporter: Yuval Itzchakov >Assignee: Yuval Itzchakov >Priority: Critical > Fix For: 2.4.0 > > > Setup: > * Spark 2.3.1 > * Java 1.8.0 (112) > * Standalone Cluster Manager > * 3 Nodes, 1 Executor per node. > Spark 2.3.0 introduced a new mechanism for caching Kafka consumers > (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel) > via KafkaDataConsumer.acquire. > It seems that there are situations (I've been trying to debug it, haven't > been able to find the root cause as of yet) where cached consumers remain "in > use" throughout the life time of the task and are never released. This can be > identified by the following line of the stack trace: > at > org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460) > Which points to: > {code:java} > } else if (existingInternalConsumer.inUse) { > // If consumer is already cached but is currently in use, then return a new > consumer > NonCachedKafkaDataConsumer(newInternalConsumer) > {code} > Meaning the existing consumer created for that `TopicPartition` is still in > use for some reason. The weird thing is that you can see this for very old > tasks which have already finished successfully. > I've traced down this leak using file leak detector, attaching it to the > running Executor JVM process. I've emitted the list of open file descriptors > which [you can find > here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d], > and you can see that the majority of them are epoll FD used by Kafka > Consumers, indicating that they aren't closing. > Spark graph: > {code:java} > kafkaStream > .load() > .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") > .as[(String, String)] > .flatMap {...} > .groupByKey(...) > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...) > .foreach(...) > .outputMode(OutputMode.Update) > .option("checkpointLocation", > sparkConfiguration.properties.checkpointDirectory) > .start() > .awaitTermination(){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25026) Binary releases should contain some copy of compiled external integration modules
[ https://issues.apache.org/jira/browse/SPARK-25026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25026: -- Summary: Binary releases should contain some copy of compiled external integration modules (was: Binary releases don't contain Kafka integration modules) Taking this a slightly different direction: what about including the binary compiled modules in an "external-jars/' dir in the binary release? right now, these modules aren't in the binary release at all, which seems odd for a full binary release of Spark. That much wouldn't entail any behavior change at all. > Binary releases should contain some copy of compiled external integration > modules > - > > Key: SPARK-25026 > URL: https://issues.apache.org/jira/browse/SPARK-25026 > Project: Spark > Issue Type: Improvement > Components: Build, Structured Streaming >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster
[ https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569460#comment-16569460 ] rajanimaski commented on SPARK-20696: - [~adiubc] Here's the custom code developed for scala kmeans that extends spark mllib libraries. Note that some methods duplicated from existing spark kmean because they were not extendable(private in other words) [https://github.com/rajanim/selective-search/tree/master/src/main/scala/org/sfsu/cs/clustering/kmeans] >From what I have heard, the pyspark's kmean works as intended. So I am >assuming that it has its own kmean implementation and not calling/utilizing >scala's kmean libraries. > tf-idf document clustering with K-means in Apache Spark putting points into > one cluster > --- > > Key: SPARK-20696 > URL: https://issues.apache.org/jira/browse/SPARK-20696 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Nassir >Priority: Major > > I am trying to do the classic job of clustering text documents by > pre-processing, generating tf-idf matrix, and then applying K-means. However, > testing this workflow on the classic 20NewsGroup dataset results in most > documents being clustered into one cluster. (I have initially tried to > cluster all documents from 6 of the 20 groups - so expecting to cluster into > 6 clusters). > I am implementing this in Apache Spark as my purpose is to utilise this > technique on millions of documents. Here is the code written in Pyspark on > Databricks: > #declare path to folder containing 6 of 20 news group categories > path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % > MOUNT_NAME > #read all the text files from the 6 folders. Each entity is an entire > document. > text_files = sc.wholeTextFiles(path).cache() > #convert rdd to dataframe > df = text_files.toDF(["filePath", "document"]).cache() > from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer > #tokenize the document text > tokenizer = Tokenizer(inputCol="document", outputCol="tokens") > tokenized = tokenizer.transform(df).cache() > from pyspark.ml.feature import StopWordsRemover > remover = StopWordsRemover(inputCol="tokens", > outputCol="stopWordsRemovedTokens") > stopWordsRemoved_df = remover.transform(tokenized).cache() > hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", > outputCol="rawFeatures", numFeatures=20) > tfVectors = hashingTF.transform(stopWordsRemoved_df).cache() > idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) > idfModel = idf.fit(tfVectors) > tfIdfVectors = idfModel.transform(tfVectors).cache() > #note that I have also tried to use normalized data, but get the same result > from pyspark.ml.feature import Normalizer > from pyspark.ml.linalg import Vectors > normalizer = Normalizer(inputCol="features", outputCol="normFeatures") > l2NormData = normalizer.transform(tfIdfVectors) > from pyspark.ml.clustering import KMeans > # Trains a KMeans model. > kmeans = KMeans().setK(6).setMaxIter(20) > km_model = kmeans.fit(l2NormData) > clustersTable = km_model.transform(l2NormData) > [output showing most documents get clustered into cluster 0][1] > ID number_of_documents_in_cluster > 0 3024 > 3 5 > 1 3 > 5 2 > 2 2 > 4 1 > As you can see most of my data points get clustered into cluster 0, and I > cannot figure out what I am doing wrong as all the tutorials and code I have > come across online point to using this method. > In addition I have also tried normalizing the tf-idf matrix before K-means > but that also produces the same result. I know cosine distance is a better > measure to use, but I expected using standard K-means in Apache Spark would > provide meaningful results. > Can anyone help with regards to whether I have a bug in my code, or if > something is missing in my data clustering pipeline? > (Question also asked in Stackoverflow before: > http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one) > Thank you in advance! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors
[ https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569432#comment-16569432 ] Yuval Itzchakov edited comment on SPARK-24987 at 8/5/18 10:58 AM: -- [~c...@koeninger.org] Is there any chance this will make it in time for 2.3.2? This is a critical fix for us (or more generally for all users of the Kafka source). was (Author: yuval.itzchakov): [~c...@koeninger.org] Is there any chance this will make it in time for 2.3.2? This is a critical fix for us. > Kafka Cached Consumer Leaking File Descriptors > -- > > Key: SPARK-24987 > URL: https://issues.apache.org/jira/browse/SPARK-24987 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 > Environment: Spark 2.3.1 > Java(TM) SE Runtime Environment (build 1.8.0_112-b15) > Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode) > >Reporter: Yuval Itzchakov >Assignee: Yuval Itzchakov >Priority: Critical > Fix For: 2.4.0 > > > Setup: > * Spark 2.3.1 > * Java 1.8.0 (112) > * Standalone Cluster Manager > * 3 Nodes, 1 Executor per node. > Spark 2.3.0 introduced a new mechanism for caching Kafka consumers > (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel) > via KafkaDataConsumer.acquire. > It seems that there are situations (I've been trying to debug it, haven't > been able to find the root cause as of yet) where cached consumers remain "in > use" throughout the life time of the task and are never released. This can be > identified by the following line of the stack trace: > at > org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460) > Which points to: > {code:java} > } else if (existingInternalConsumer.inUse) { > // If consumer is already cached but is currently in use, then return a new > consumer > NonCachedKafkaDataConsumer(newInternalConsumer) > {code} > Meaning the existing consumer created for that `TopicPartition` is still in > use for some reason. The weird thing is that you can see this for very old > tasks which have already finished successfully. > I've traced down this leak using file leak detector, attaching it to the > running Executor JVM process. I've emitted the list of open file descriptors > which [you can find > here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d], > and you can see that the majority of them are epoll FD used by Kafka > Consumers, indicating that they aren't closing. > Spark graph: > {code:java} > kafkaStream > .load() > .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") > .as[(String, String)] > .flatMap {...} > .groupByKey(...) > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...) > .foreach(...) > .outputMode(OutputMode.Update) > .option("checkpointLocation", > sparkConfiguration.properties.checkpointDirectory) > .start() > .awaitTermination(){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors
[ https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuval Itzchakov updated SPARK-24987: Shepherd: (was: Tathagata Das) > Kafka Cached Consumer Leaking File Descriptors > -- > > Key: SPARK-24987 > URL: https://issues.apache.org/jira/browse/SPARK-24987 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 > Environment: Spark 2.3.1 > Java(TM) SE Runtime Environment (build 1.8.0_112-b15) > Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode) > >Reporter: Yuval Itzchakov >Assignee: Yuval Itzchakov >Priority: Critical > Fix For: 2.4.0 > > > Setup: > * Spark 2.3.1 > * Java 1.8.0 (112) > * Standalone Cluster Manager > * 3 Nodes, 1 Executor per node. > Spark 2.3.0 introduced a new mechanism for caching Kafka consumers > (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel) > via KafkaDataConsumer.acquire. > It seems that there are situations (I've been trying to debug it, haven't > been able to find the root cause as of yet) where cached consumers remain "in > use" throughout the life time of the task and are never released. This can be > identified by the following line of the stack trace: > at > org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460) > Which points to: > {code:java} > } else if (existingInternalConsumer.inUse) { > // If consumer is already cached but is currently in use, then return a new > consumer > NonCachedKafkaDataConsumer(newInternalConsumer) > {code} > Meaning the existing consumer created for that `TopicPartition` is still in > use for some reason. The weird thing is that you can see this for very old > tasks which have already finished successfully. > I've traced down this leak using file leak detector, attaching it to the > running Executor JVM process. I've emitted the list of open file descriptors > which [you can find > here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d], > and you can see that the majority of them are epoll FD used by Kafka > Consumers, indicating that they aren't closing. > Spark graph: > {code:java} > kafkaStream > .load() > .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") > .as[(String, String)] > .flatMap {...} > .groupByKey(...) > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...) > .foreach(...) > .outputMode(OutputMode.Update) > .option("checkpointLocation", > sparkConfiguration.properties.checkpointDirectory) > .start() > .awaitTermination(){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors
[ https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569432#comment-16569432 ] Yuval Itzchakov edited comment on SPARK-24987 at 8/5/18 10:54 AM: -- [~c...@koeninger.org] Is there any chance this will make it in time for 2.3.2? This is a critical fix for us. was (Author: yuval.itzchakov): Is there any chance this will make it in time for 2.3.2? This is a critical fix for us. > Kafka Cached Consumer Leaking File Descriptors > -- > > Key: SPARK-24987 > URL: https://issues.apache.org/jira/browse/SPARK-24987 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 > Environment: Spark 2.3.1 > Java(TM) SE Runtime Environment (build 1.8.0_112-b15) > Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode) > >Reporter: Yuval Itzchakov >Assignee: Yuval Itzchakov >Priority: Critical > Fix For: 2.4.0 > > > Setup: > * Spark 2.3.1 > * Java 1.8.0 (112) > * Standalone Cluster Manager > * 3 Nodes, 1 Executor per node. > Spark 2.3.0 introduced a new mechanism for caching Kafka consumers > (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel) > via KafkaDataConsumer.acquire. > It seems that there are situations (I've been trying to debug it, haven't > been able to find the root cause as of yet) where cached consumers remain "in > use" throughout the life time of the task and are never released. This can be > identified by the following line of the stack trace: > at > org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460) > Which points to: > {code:java} > } else if (existingInternalConsumer.inUse) { > // If consumer is already cached but is currently in use, then return a new > consumer > NonCachedKafkaDataConsumer(newInternalConsumer) > {code} > Meaning the existing consumer created for that `TopicPartition` is still in > use for some reason. The weird thing is that you can see this for very old > tasks which have already finished successfully. > I've traced down this leak using file leak detector, attaching it to the > running Executor JVM process. I've emitted the list of open file descriptors > which [you can find > here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d], > and you can see that the majority of them are epoll FD used by Kafka > Consumers, indicating that they aren't closing. > Spark graph: > {code:java} > kafkaStream > .load() > .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") > .as[(String, String)] > .flatMap {...} > .groupByKey(...) > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...) > .foreach(...) > .outputMode(OutputMode.Update) > .option("checkpointLocation", > sparkConfiguration.properties.checkpointDirectory) > .start() > .awaitTermination(){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors
[ https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569432#comment-16569432 ] Yuval Itzchakov commented on SPARK-24987: - Is there any chance this will make it in time for 2.3.2? This is a critical fix for us. > Kafka Cached Consumer Leaking File Descriptors > -- > > Key: SPARK-24987 > URL: https://issues.apache.org/jira/browse/SPARK-24987 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 > Environment: Spark 2.3.1 > Java(TM) SE Runtime Environment (build 1.8.0_112-b15) > Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode) > >Reporter: Yuval Itzchakov >Assignee: Yuval Itzchakov >Priority: Critical > Fix For: 2.4.0 > > > Setup: > * Spark 2.3.1 > * Java 1.8.0 (112) > * Standalone Cluster Manager > * 3 Nodes, 1 Executor per node. > Spark 2.3.0 introduced a new mechanism for caching Kafka consumers > (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel) > via KafkaDataConsumer.acquire. > It seems that there are situations (I've been trying to debug it, haven't > been able to find the root cause as of yet) where cached consumers remain "in > use" throughout the life time of the task and are never released. This can be > identified by the following line of the stack trace: > at > org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460) > Which points to: > {code:java} > } else if (existingInternalConsumer.inUse) { > // If consumer is already cached but is currently in use, then return a new > consumer > NonCachedKafkaDataConsumer(newInternalConsumer) > {code} > Meaning the existing consumer created for that `TopicPartition` is still in > use for some reason. The weird thing is that you can see this for very old > tasks which have already finished successfully. > I've traced down this leak using file leak detector, attaching it to the > running Executor JVM process. I've emitted the list of open file descriptors > which [you can find > here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d], > and you can see that the majority of them are epoll FD used by Kafka > Consumers, indicating that they aren't closing. > Spark graph: > {code:java} > kafkaStream > .load() > .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") > .as[(String, String)] > .flatMap {...} > .groupByKey(...) > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...) > .foreach(...) > .outputMode(OutputMode.Update) > .option("checkpointLocation", > sparkConfiguration.properties.checkpointDirectory) > .start() > .awaitTermination(){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null
Izek Greenfield created SPARK-25028: --- Summary: AnalyzePartitionCommand failed with NPE if value is null Key: SPARK-25028 URL: https://issues.apache.org/jira/browse/SPARK-25028 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Izek Greenfield on line 143: val partitionColumnValues = partitionColumns.indices.map(r.get(_).toString) if the value is NULL the code will fail with NPE -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23939) High-order function: transform_keys(map, function) → map
[ https://issues.apache.org/jira/browse/SPARK-23939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569397#comment-16569397 ] Neha Patil commented on SPARK-23939: I am working on this one, will have PR ready in a couple of hours. > High-order function: transform_keys(map, function) → > map > > > Key: SPARK-23939 > URL: https://issues.apache.org/jira/browse/SPARK-23939 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map that applies function to each entry of map and transforms the > keys. > {noformat} > SELECT transform_keys(MAP(ARRAY[], ARRAY[]), (k, v) -> k + 1); -- {} > SELECT transform_keys(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) -> > k + 1); -- {2 -> a, 3 -> b, 4 -> c} > SELECT transform_keys(MAP(ARRAY ['a', 'b', 'c'], ARRAY [1, 2, 3]), (k, v) -> > v * v); -- {1 -> 1, 4 -> 2, 9 -> 3} > SELECT transform_keys(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || > CAST(v as VARCHAR)); -- {a1 -> 1, b2 -> 2} > SELECT transform_keys(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {one -> 1.0, > two -> 1.4} > (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k]); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org