[jira] [Commented] (SPARK-44212) Upgrade netty dependencies to 4.1.94.Final due to CVE-2023-34462
[ https://issues.apache.org/jira/browse/SPARK-44212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737755#comment-17737755 ] Kazuaki Ishizaki commented on SPARK-44212: -- [https://github.com/apache/spark/pull/41681#pullrequestreview-1496876723|http://example.com] is discussing the upgrade of netty. > Upgrade netty dependencies to 4.1.94.Final due to CVE-2023-34462 > > > Key: SPARK-44212 > URL: https://issues.apache.org/jira/browse/SPARK-44212 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Raúl Cumplido >Priority: Major > > Hi, > On the Apache Arrow project we have noticed that our nightly integration > tests with spark started failing lately. With some investigation I've noticed > that we are defining a different version of the Java netty dependencies. We > upgraded to 4.1.94.Final due to the CVE on the title: > [https://github.com/advisories/GHSA-6mjq-h674-j845] > Our PR upgrading the version: [https://github.com/apache/arrow/issues/36209] > I have opened an issue on the Apache Arrow repository to try and fix > something else on our side but I was wondering if you would want to update > the version to solve the CVE. > > Thanks > Raúl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
Kazuaki Ishizaki created SPARK-33590: Summary: Missing submenus for Performance Tuning in Spark SQL Guide Key: SPARK-33590 URL: https://issues.apache.org/jira/browse/SPARK-33590 Project: Spark Issue Type: Bug Components: docs Affects Versions: 3.0.1, 3.0.0 Reporter: Kazuaki Ishizaki Attachments: image-2020-11-30-00-04-07-969.png Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query Execution) are missing !image-2020-11-30-00-03-04-814.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
[ https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-33590: - Attachment: image-2020-11-30-00-04-07-969.png > Missing submenus for Performance Tuning in Spark SQL Guide > --- > > Key: SPARK-33590 > URL: https://issues.apache.org/jira/browse/SPARK-33590 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: Kazuaki Ishizaki >Priority: Minor > Attachments: image-2020-11-30-00-04-07-969.png > > > Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query > Execution) are missing > !image-2020-11-30-00-03-04-814.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide
[ https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-33590: - Description: Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query Execution) are missing !image-2020-11-30-00-04-07-969.png! was: Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query Execution) are missing !image-2020-11-30-00-03-04-814.png! > Missing submenus for Performance Tuning in Spark SQL Guide > --- > > Key: SPARK-33590 > URL: https://issues.apache.org/jira/browse/SPARK-33590 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: Kazuaki Ishizaki >Priority: Minor > Attachments: image-2020-11-30-00-04-07-969.png > > > Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query > Execution) are missing > !image-2020-11-30-00-04-07-969.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0
[ https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189784#comment-17189784 ] Kazuaki Ishizaki commented on SPARK-32312: -- I think that [this|https://github.com/apache/arrow/pull/7746] is a work to succeed a build. What work do we need further? > Upgrade Apache Arrow to 1.0.0 > - > > Key: SPARK-32312 > URL: https://issues.apache.org/jira/browse/SPARK-32312 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Apache Arrow will soon release v1.0.0 which provides backward/forward > compatibility guarantees as well as a number of fixes and improvements. This > will upgrade the Java artifact and PySpark API. Although PySpark will not > need special changes, it might be a good idea to bump up minimum supported > version and CI testing. > TBD: list of important improvements and fixes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases
[ https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096707#comment-17096707 ] Kazuaki Ishizaki commented on SPARK-31538: -- Do we want to port this to 2.4.6 and 3.0? > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases > -- > > Key: SPARK-31538 > URL: https://issues.apache.org/jira/browse/SPARK-31538 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases
[ https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091699#comment-17091699 ] Kazuaki Ishizaki commented on SPARK-31538: -- We could backport this. On the other hand, this is not a bug fix. As far as I know, this change does not find new issues immediately. If we have already found problems related to this, they should have been backported to the 2.4 branch. I think that this is a nice-to-have in the maintenance branch. > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases > -- > > Key: SPARK-31538 > URL: https://issues.apache.org/jira/browse/SPARK-31538 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061914#comment-17061914 ] Kazuaki Ishizaki commented on SPARK-25728: -- For now, no update on my side. I am happy if the community or PMCs want to revitalize it. Otherwise, should I close this? > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > This JIRA entry is to start a discussion about adding structure intermediate > representation for generating Java code from a program using DataFrame or > Dataset API, in addition to the current String-based representation. > This addition is based on the discussions in [a > thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196]. > Please feel free to comment on this JIRA entry or [Google > Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing], > too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058174#comment-17058174 ] Kazuaki Ishizaki commented on SPARK-25987: -- Sorry for coming here late since I am busy these days (hopefully by tomorrow). The analysis looks correct due to the recursive call at https://github.com/janino-compiler/janino/blob/ccb4931fd605ed8081839f962b57ac3734db78ee/janino/src/main/java/org/codehaus/janino/CodeContext.java#L385. As [~kabhwan] suggested, to add {{-Xss2m}} is conservative and safe workaround. > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException
[ https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035464#comment-17035464 ] Kazuaki Ishizaki commented on SPARK-30711: -- The reason to show the exception as an error is whether the regular or the error path is decided after this exception occurs. Let us think whether we can show this exception will be shown later when the regular or the error path is decided. Of course, this change requires that agreement in the community. The way to avoid the bytecode growth is to reduce the size of one SQL statement. In this case, can you reduce # of when clause in one SQL? > 64KB JVM bytecode limit - janino.InternalCompilerException > -- > > Key: SPARK-30711 > URL: https://issues.apache.org/jira/browse/SPARK-30711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 > Environment: Windows 10 > Spark 2.4.4 > scalaVersion 2.11.12 > JVM Oracle 1.8.0_221-b11 >Reporter: Frederik Schreiber >Priority: Major > > Exception > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KB at > org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at > org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at > org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at > org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at >
[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException
[ https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033112#comment-17033112 ] Kazuaki Ishizaki commented on SPARK-30711: -- [~schreiber] Sorry, I made a mistake. This test case can pass with master and branch-2.4 in my end. I have one question. Which value do you set into {{spark.sql.codegen.fallback}} ? The idea of the whole-stage codegen is stop using the whole-stage codegen if the generated code is larger than 64KB. For it, [this code|https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L600-L607] catches the {{org.codehaus.janino.InternalCompilerException}} and tries to recompile the code with smaller pieces. > 64KB JVM bytecode limit - janino.InternalCompilerException > -- > > Key: SPARK-30711 > URL: https://issues.apache.org/jira/browse/SPARK-30711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 > Environment: Windows 10 > Spark 2.4.4 > scalaVersion 2.11.12 > JVM Oracle 1.8.0_221-b11 >Reporter: Frederik Schreiber >Priority: Major > > Exception > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KB at > org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at > org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at > org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at > org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at
[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException
[ https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031276#comment-17031276 ] Kazuaki Ishizaki commented on SPARK-30711: -- Now, I am looking at this with master branch at first. > 64KB JVM bytecode limit - janino.InternalCompilerException > -- > > Key: SPARK-30711 > URL: https://issues.apache.org/jira/browse/SPARK-30711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 > Environment: Windows 10 > Spark 2.4.4 > scalaVersion 2.11.12 > JVM Oracle 1.8.0_221-b11 >Reporter: Frederik Schreiber >Priority: Major > > Exception > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KB at > org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at > org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at > org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at > org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at >
[jira] [Comment Edited] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException
[ https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029724#comment-17029724 ] Kazuaki Ishizaki edited comment on SPARK-30711 at 2/4/20 10:20 AM: --- In my environment, both v3.0.0-preview `007c873a` and master `6097b343` branches cause the exception. was (Author: kiszk): In my environment, both v3.0.0-preview and master branches cause the exception. > 64KB JVM bytecode limit - janino.InternalCompilerException > -- > > Key: SPARK-30711 > URL: https://issues.apache.org/jira/browse/SPARK-30711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: Windows 10 > Spark 2.4.4 > scalaVersion 2.11.12 > JVM Oracle 1.8.0_221-b11 >Reporter: Frederik Schreiber >Priority: Major > > Exception > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KB at > org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at > org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at > org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at > org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at >
[jira] [Comment Edited] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException
[ https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029724#comment-17029724 ] Kazuaki Ishizaki edited comment on SPARK-30711 at 2/4/20 10:19 AM: --- In my environment, both v3.0.0-preview and master branches cause the exception. was (Author: kiszk): In my environment, both v3.0.0-preview and master branches causes the exception. > 64KB JVM bytecode limit - janino.InternalCompilerException > -- > > Key: SPARK-30711 > URL: https://issues.apache.org/jira/browse/SPARK-30711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: Windows 10 > Spark 2.4.4 > scalaVersion 2.11.12 > JVM Oracle 1.8.0_221-b11 >Reporter: Frederik Schreiber >Priority: Major > > Exception > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KB at > org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at > org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at > org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at > org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at >
[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException
[ https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029724#comment-17029724 ] Kazuaki Ishizaki commented on SPARK-30711: -- In my environment, both v3.0.0-preview and master branches causes the exception. > 64KB JVM bytecode limit - janino.InternalCompilerException > -- > > Key: SPARK-30711 > URL: https://issues.apache.org/jira/browse/SPARK-30711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: Windows 10 > Spark 2.4.4 > scalaVersion 2.11.12 > JVM Oracle 1.8.0_221-b11 >Reporter: Frederik Schreiber >Priority: Major > > Exception > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KB at > org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at > org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at > org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at > org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at >
[jira] [Commented] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit
[ https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028784#comment-17028784 ] Kazuaki Ishizaki commented on SPARK-22510: -- [~schreiber] Thank you for reporting the problem. Could you please share a program that causes this problem? We want to know which operators cause this issue at first. > Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit > > > Key: SPARK-22510 > URL: https://issues.apache.org/jira/browse/SPARK-22510 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Kazuaki Ishizaki >Priority: Major > Labels: bulk-closed, releasenotes > > Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant > pool entry limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28906) `bin/spark-submit --version` shows incorrect info
[ https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920594#comment-16920594 ] Kazuaki Ishizaki edited comment on SPARK-28906 at 9/2/19 8:48 AM: -- For the information on {{git}} comand, {{.git}} directory is deleted after {{git clone}} is executed. As a result, we cannot get infomration on {{git}} command. When I tentatively stop deleting {{.git}} directory, {{spark-version-info.properties}} can include the correct information like: {code} version=2.3.4 user=ishizaki revision=8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210 branch=HEAD date=2019-09-02T02:31:25Z url=https://gitbox.apache.org/repos/asf/spark.git {code} was (Author: kiszk): For the information on {{git}} comand, {{.git}} directory is deleted after {{git clone}} is executed. When I tentatively stop deleting {{.git}} directory, {{spark-version-info.properties}} can include the correct information like: {code} version=2.3.4 user=ishizaki revision=8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210 branch=HEAD date=2019-09-02T02:31:25Z url=https://gitbox.apache.org/repos/asf/spark.git {code} > `bin/spark-submit --version` shows incorrect info > - > > Key: SPARK-28906 > URL: https://issues.apache.org/jira/browse/SPARK-28906 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, > 3.0.0, 2.4.3 >Reporter: Marcelo Vanzin >Priority: Minor > Attachments: image-2019-08-29-05-50-13-526.png > > > Since Spark 2.3.1, `spark-submit` shows a wrong information. > {code} > $ bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.3 > /_/ > Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222 > Branch > Compiled by user on 2019-02-04T13:00:46Z > Revision > Url > Type --help for more information. > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info
[ https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920594#comment-16920594 ] Kazuaki Ishizaki commented on SPARK-28906: -- For the information on {{git}} comand, {{.git}} directory is deleted after {{git clone}} is executed. When I tentatively stop deleting {{.git}} directory, {{spark-version-info.properties}} can include the correct information like: {code} version=2.3.4 user=ishizaki revision=8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210 branch=HEAD date=2019-09-02T02:31:25Z url=https://gitbox.apache.org/repos/asf/spark.git {code} > `bin/spark-submit --version` shows incorrect info > - > > Key: SPARK-28906 > URL: https://issues.apache.org/jira/browse/SPARK-28906 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, > 3.0.0, 2.4.3 >Reporter: Marcelo Vanzin >Priority: Minor > Attachments: image-2019-08-29-05-50-13-526.png > > > Since Spark 2.3.1, `spark-submit` shows a wrong information. > {code} > $ bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.3 > /_/ > Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222 > Branch > Compiled by user on 2019-02-04T13:00:46Z > Revision > Url > Type --help for more information. > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info
[ https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919893#comment-16919893 ] Kazuaki Ishizaki commented on SPARK-28906: -- For user name, we have to pass {{USER}} environment variable to the docker container at the end of {{do-release-docker.sh}}. I created a patch to fix this. For other information to be got by {{git}} command, {{spark-build-info}} script is not executed at the wrong directory (i.e. out of the cloned directory). My guess is the command is executed under the work directory. I did not creat a patch yet. > `bin/spark-submit --version` shows incorrect info > - > > Key: SPARK-28906 > URL: https://issues.apache.org/jira/browse/SPARK-28906 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, > 3.0.0, 2.4.3 >Reporter: Marcelo Vanzin >Priority: Minor > Attachments: image-2019-08-29-05-50-13-526.png > > > Since Spark 2.3.1, `spark-submit` shows a wrong information. > {code} > $ bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.3 > /_/ > Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222 > Branch > Compiled by user on 2019-02-04T13:00:46Z > Revision > Url > Type --help for more information. > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28906) `bin/spark-submit --version` shows incorrect info
[ https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919420#comment-16919420 ] Kazuaki Ishizaki edited comment on SPARK-28906 at 8/30/19 10:51 AM: In {{jars/spark-core_2.11-2.3.*.jar}}, {{spark-version-info.properties}} exists. This file is different between 2.3.0 and 2.3.4. This file is generated by `build/spark-build-info`. {code} $ cat spark-version-info.properties.230 version=2.3.0 user=sameera revision=a0d7949896e70f427e7f3942ff340c9484ff0aab branch=master date=2018-02-22T19:24:38Z url=g...@github.com:sameeragarwal/spark.git $ cat spark-version-info.properties.234 version=2.3.4 user= revision= branch= date=2019-08-26T08:29:39Z url= {code} was (Author: kiszk): In {{jars/spark-core_2.11-2.3.*.jar}}, {{spark-version-info.properties}} exists. This file is different between 2.3.0 and 2.3.4. {code} $ cat spark-version-info.properties.230 version=2.3.0 user=sameera revision=a0d7949896e70f427e7f3942ff340c9484ff0aab branch=master date=2018-02-22T19:24:38Z url=g...@github.com:sameeragarwal/spark.git $ cat spark-version-info.properties.234 version=2.3.4 user= revision= branch= date=2019-08-26T08:29:39Z url= {code} > `bin/spark-submit --version` shows incorrect info > - > > Key: SPARK-28906 > URL: https://issues.apache.org/jira/browse/SPARK-28906 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, > 3.0.0, 2.4.3 >Reporter: Marcelo Vanzin >Priority: Minor > Attachments: image-2019-08-29-05-50-13-526.png > > > Since Spark 2.3.1, `spark-submit` shows a wrong information. > {code} > $ bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.3 > /_/ > Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222 > Branch > Compiled by user on 2019-02-04T13:00:46Z > Revision > Url > Type --help for more information. > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info
[ https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919420#comment-16919420 ] Kazuaki Ishizaki commented on SPARK-28906: -- In {{jars/spark-core_2.11-2.3.*.jar}}, {{spark-version-info.properties}} exists. This file is different between 2.3.0 and 2.3.4. {code} $ cat spark-version-info.properties.230 version=2.3.0 user=sameera revision=a0d7949896e70f427e7f3942ff340c9484ff0aab branch=master date=2018-02-22T19:24:38Z url=g...@github.com:sameeragarwal/spark.git $ cat spark-version-info.properties.234 version=2.3.4 user= revision= branch= date=2019-08-26T08:29:39Z url= {code} > `bin/spark-submit --version` shows incorrect info > - > > Key: SPARK-28906 > URL: https://issues.apache.org/jira/browse/SPARK-28906 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, > 3.0.0, 2.4.3 >Reporter: Marcelo Vanzin >Priority: Minor > Attachments: image-2019-08-29-05-50-13-526.png > > > Since Spark 2.3.1, `spark-submit` shows a wrong information. > {code} > $ bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.3 > /_/ > Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222 > Branch > Compiled by user on 2019-02-04T13:00:46Z > Revision > Url > Type --help for more information. > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info
[ https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919330#comment-16919330 ] Kazuaki Ishizaki commented on SPARK-28906: -- I attached output of 2.3.0 and 2.3.4 in one comment as below. Let me see the script, too. ``` $ spark-2.3.0-bin-hadoop2.6/bin/spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 /_/ Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_212 Branch master Compiled by user sameera on 2018-02-22T19:24:38Z Revision a0d7949896e70f427e7f3942ff340c9484ff0aab Url g...@github.com:sameeragarwal/spark.git Type --help for more information. $ spark-2.3.4-bin-hadoop2.6/bin/spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.4 /_/ Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_212 Branch Compiled by user on 2019-08-26T08:29:39Z Revision Url Type --help for more information. ``` > `bin/spark-submit --version` shows incorrect info > - > > Key: SPARK-28906 > URL: https://issues.apache.org/jira/browse/SPARK-28906 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, > 3.0.0, 2.4.3 >Reporter: Marcelo Vanzin >Priority: Minor > Attachments: image-2019-08-29-05-50-13-526.png > > > Since Spark 2.3.1, `spark-submit` shows a wrong information. > {code} > $ bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.3 > /_/ > Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222 > Branch > Compiled by user on 2019-02-04T13:00:46Z > Revision > Url > Type --help for more information. > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28891) do-release-docker.sh in master does not work for branch-2.3
[ https://issues.apache.org/jira/browse/SPARK-28891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-28891: - Description: According to [~maropu], [do-release-docker.sh|https://github.com/apache/spark/blob/master/dev/create-release/do-release-docker.sh] in master branch worked for 2.3.3 release for branch-2.3. After updates in [this PR|https://github.com/apache/spark/pull/23098], {{do-release-docker.sh}} does not work for branch-2.3 now as shown: {code} ... Checked out revision 35358. Copying release tarballs cp: cannot stat 'pyspark-*': No such file or directory {code} was: According to [~maropu], [do-release-docker.sh|https://github.com/apache/spark/blob/master/dev/create-release/do-release-docker.sh] in master branch worked for 2.3.3 release for branch-2.3. After updates in [this PR|https://github.com/apache/spark/pull/23098], {{do-release-docker.sh}} does not work for branch-2.3 now. > do-release-docker.sh in master does not work for branch-2.3 > --- > > Key: SPARK-28891 > URL: https://issues.apache.org/jira/browse/SPARK-28891 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.4 >Reporter: Kazuaki Ishizaki >Priority: Major > > According to [~maropu], > [do-release-docker.sh|https://github.com/apache/spark/blob/master/dev/create-release/do-release-docker.sh] > in master branch worked for 2.3.3 release for branch-2.3. > After updates in [this PR|https://github.com/apache/spark/pull/23098], > {{do-release-docker.sh}} does not work for branch-2.3 now as shown: > {code} > ... > Checked out revision 35358. > Copying release tarballs > cp: cannot stat 'pyspark-*': No such file or directory > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28891) do-release-docker.sh in master does not work for branch-2.3
Kazuaki Ishizaki created SPARK-28891: Summary: do-release-docker.sh in master does not work for branch-2.3 Key: SPARK-28891 URL: https://issues.apache.org/jira/browse/SPARK-28891 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.3.4 Reporter: Kazuaki Ishizaki According to [~maropu], [do-release-docker.sh|https://github.com/apache/spark/blob/master/dev/create-release/do-release-docker.sh] in master branch worked for 2.3.3 release for branch-2.3. After updates in [this PR|https://github.com/apache/spark/pull/23098], {{do-release-docker.sh}} does not work for branch-2.3 now. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun
[ https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910983#comment-16910983 ] Kazuaki Ishizaki commented on SPARK-28699: -- [~dongjoon] Thank you for pointing out my typo. You are right. I should have said {2.3.4-rc1}. Actually, while I was doing the following, it is not reflected at the repository yet! After fixing this, let me restart the release process for {2.3.4-rc1}. {code} Release details: BRANCH: branch-2.3 VERSION: 2.3.4 TAG: v2.3.4-rc1 {code} > Cache an indeterminate RDD could lead to incorrect result while stage rerun > --- > > Key: SPARK-28699 > URL: https://issues.apache.org/jira/browse/SPARK-28699 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > Labels: correctness > > Related with SPARK-23207 SPARK-23243 > It's another case for the indeterminate stage/RDD rerun while stage rerun > happened. In the CachedRDDBuilder. > We can reproduce this by the following code, thanks to Tyson for reporting > this! > > {code:scala} > import scala.sys.process._ > import org.apache.spark.TaskContext > val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)} > // kill an executor in the stage that performs repartition(239) > val df = res.repartition(113).cache.repartition(239).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && > TaskContext.get.stageAttemptNumber == 0) { > throw new Exception("pkill -f -n java".!!) > } > x > } > val r2 = df.distinct.count() > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun
[ https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910907#comment-16910907 ] Kazuaki Ishizaki commented on SPARK-28699: -- [~smilegator] Thank you for cc. I wait for fixing this. I was in the middle of releasing RC1. Thus, there is already {{2.4.4-rc1}} tag in the [branch-2.3|https://github.com/apache/spark/tree/branch-2.3]. Should I remove this tag and release rc1? Or should I leave this tag and release rc2 at first? > Cache an indeterminate RDD could lead to incorrect result while stage rerun > --- > > Key: SPARK-28699 > URL: https://issues.apache.org/jira/browse/SPARK-28699 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > Labels: correctness > > Related with SPARK-23207 SPARK-23243 > It's another case for the indeterminate stage/RDD rerun while stage rerun > happened. In the CachedRDDBuilder, we miss tracking the `isOrderSensitive` > characteristic to the newly created MapPartitionsRDD. > We can reproduce this by the following code, thanks to Tyson for reporting > this! > > {code:scala} > import scala.sys.process._ > import org.apache.spark.TaskContext > val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)} > // kill an executor in the stage that performs repartition(239) > val df = res.repartition(113).cache.repartition(239).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && > TaskContext.get.stageAttemptNumber == 0) { > throw new Exception("pkill -f -n java".!!) > } > x > } > val r2 = df.distinct.count() > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27761) Make UDF nondeterministic by default(?)
[ https://issues.apache.org/jira/browse/SPARK-27761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844396#comment-16844396 ] Kazuaki Ishizaki commented on SPARK-27761: -- IMHO, it looks good. It is fine for UDF writers to keep current default behavior. > Make UDF nondeterministic by default(?) > --- > > Key: SPARK-27761 > URL: https://issues.apache.org/jira/browse/SPARK-27761 > Project: Spark > Issue Type: Brainstorming > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sunitha Kambhampati >Priority: Minor > > Opening this issue as a followup from a discussion/question on this PR for an > optimization involving deterministic udf: > https://github.com/apache/spark/pull/24593#pullrequestreview-237361795 > "We even should discuss whether all UDFs must be deterministic or > non-deterministic by default." > Basically today in Spark 2.4, Scala UDFs are marked deterministic by default > and it is implicit. To mark a udf as non deterministic, they need to call > this method asNondeterministic(). > The concern's expressed are that users are not aware of this property and its > implications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27752) Updata lz4-java from 1.5.2 to 1.6.0
Kazuaki Ishizaki created SPARK-27752: Summary: Updata lz4-java from 1.5.2 to 1.6.0 Key: SPARK-27752 URL: https://issues.apache.org/jira/browse/SPARK-27752 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Kazuaki Ishizaki Update lz4-java that is available from https://github.com/lz4/lz4-java. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27752) Updata lz4-java from 1.5.1 to 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-27752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-27752: - Summary: Updata lz4-java from 1.5.1 to 1.6.0 (was: Updata lz4-java from 1.5.2 to 1.6.0) > Updata lz4-java from 1.5.1 to 1.6.0 > --- > > Key: SPARK-27752 > URL: https://issues.apache.org/jira/browse/SPARK-27752 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > Update lz4-java that is available from https://github.com/lz4/lz4-java. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27684) Reduce ScalaUDF conversion overheads for primitives
[ https://issues.apache.org/jira/browse/SPARK-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838229#comment-16838229 ] Kazuaki Ishizaki commented on SPARK-27684: -- Interesting idea > Reduce ScalaUDF conversion overheads for primitives > --- > > Key: SPARK-27684 > URL: https://issues.apache.org/jira/browse/SPARK-27684 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > I believe that we can reduce ScalaUDF overheads when operating over primitive > types. > In [ScalaUDF's > doGenCode|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L991] > we have logic to convert UDF function input types from Catalyst internal > types to Scala types (for example, this is used to convert UTF8Strings to > Java Strings). Similarly, we convert UDF return types. > However, UDF input argument conversion is effectively a no-op for primitive > types because {{CatalystTypeConverters.createToScalaConverter()}} returns > {{identity}} in those cases. UDF result conversion is a little tricker > because {{createToCatalystConverter()}} returns [a > function|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L413] > that handles {{Option[Primitive]}}, but it might be the case that the > Option-boxing is unusable via ScalaUDF (in which case the conversion truly is > an {{identity}} no-op). > These unnecessary no-op conversions could be quite expensive because each > call involves an index into the {{references}} array to get the converters, a > second index into the converters array to get the correct converter for the > nth input argument, and, finally, the converter invocation itself: > {code:java} > Object project_arg_0 = false ? null : ((scala.Function1[]) references[1] /* > converters */)[0].apply(project_value_3);{code} > In these cases, I believe that we can reduce lookup / invocation overheads by > modifying the ScalaUDF code generation to eliminate the conversion calls for > primitives and directly assign the unconverted result, e.g. > {code:java} > Object project_arg_0 = false ? null : project_value_3;{code} > To cleanly handle the case where we have a multi-argument UDF accepting a > mixture of primitive and non-primitive types, we might be able to keep the > {{converters}} array the same size (so indexes stay the same) but omit the > invocation of the converters for the primitive arguments (e.g. {{converters}} > is sparse / contains unused entries in case of primitives). > I spotted this optimization while trying to construct some quick benchmarks > to measure UDF invocation overheads. For example: > {code:java} > spark.udf.register("identity", (x: Int) => x) > sql("select id, id * 2, id * 3 from range(1000 * 1000 * 1000)").rdd.count() > // ~ 52 seconds > sql("select identity(id), identity(id * 2), identity(id * 3) from range(1000 > * 1000 * 1000)").rdd.count() // ~84 seconds{code} > I'm curious to see whether the optimization suggested here can close this > performance gap. It'd also be a good idea to construct more principled > microbenchmarks covering multi-argument UDFs, projections involving multiple > UDFs over different input and output types, etc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819181#comment-16819181 ] Kazuaki Ishizaki commented on SPARK-27396: -- I have one question regarding low-level API. In my understanding, this SPIP proposes code generation API for each operation at low-level for exploiting columnar storage. How does this SPIP support to store the result of the generated code into columnar storage? In particular, for {{genColumnarCode()}}. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817152#comment-16817152 ] Kazuaki Ishizaki commented on SPARK-27396: -- Thank you for sharing low-level APIs for Spark implementors. Could you please share the current thought on the high-level APIs that Spark application developers will use from their Spark applications? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can be problematic when we get into more extensive > processing. > # When caching data it can optionally
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815812#comment-16815812 ] Kazuaki Ishizaki commented on SPARK-27396: -- Sorry for being late to comment since I am traveling. Could you please let us know the proposed API changes regarding Q1-4 and Q1-5? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can be problematic when we get into more extensive > processing. > # When caching data it can optionally be cached in a columnar format if the > input is also columnar.
[jira] [Created] (SPARK-27397) Take care of OpenJ9 in JVM dependant parts
Kazuaki Ishizaki created SPARK-27397: Summary: Take care of OpenJ9 in JVM dependant parts Key: SPARK-27397 URL: https://issues.apache.org/jira/browse/SPARK-27397 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Kazuaki Ishizaki Spark includes multiple JVM dependant code such as {{SizeEstimator}}. The current Spark takes care of IBM JDK and OpenJDK. Recently, OpenJ9 has been released. However, it is not considered yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26508) Address warning messages in Java by lgtm.com
Kazuaki Ishizaki created SPARK-26508: Summary: Address warning messages in Java by lgtm.com Key: SPARK-26508 URL: https://issues.apache.org/jira/browse/SPARK-26508 Project: Spark Issue Type: Improvement Components: Examples, Spark Core, SQL Affects Versions: 3.0.0 Reporter: Kazuaki Ishizaki [lgtm.com|http://lgtm.com] provides automated code review of Java/Python/JavaScript files for OSS projects. [Here|https://lgtm.com/projects/g/apache/spark/alerts/?mode=list=warning] are warning messages regarding Apache Spark project. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26463) Use ConfigEntry for hardcoded configs for scheduler categories.
[ https://issues.apache.org/jira/browse/SPARK-26463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730882#comment-16730882 ] Kazuaki Ishizaki commented on SPARK-26463: -- I will work for this > Use ConfigEntry for hardcoded configs for scheduler categories. > --- > > Key: SPARK-26463 > URL: https://issues.apache.org/jira/browse/SPARK-26463 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > > Make the following hardcoded configs to use {{ConfigEntry}}. > {code} > spark.dynamicAllocation > spark.scheduler > spark.rpc > spark.task > spark.speculation > spark.cleaner > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26442) Use ConfigEntry for hardcoded configs.
[ https://issues.apache.org/jira/browse/SPARK-26442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730883#comment-16730883 ] Kazuaki Ishizaki commented on SPARK-26442: -- Thank you for updaing them. > Use ConfigEntry for hardcoded configs. > -- > > Key: SPARK-26442 > URL: https://issues.apache.org/jira/browse/SPARK-26442 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > > This umbrella JIRA is to make hardcoded configs to use {{ConfigEntry}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26442) Use ConfigEntry for hardcoded configs.
[ https://issues.apache.org/jira/browse/SPARK-26442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730883#comment-16730883 ] Kazuaki Ishizaki edited comment on SPARK-26442 at 12/30/18 4:18 AM: Thank you for updating them. was (Author: kiszk): Thank you for updaing them. > Use ConfigEntry for hardcoded configs. > -- > > Key: SPARK-26442 > URL: https://issues.apache.org/jira/browse/SPARK-26442 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > > This umbrella JIRA is to make hardcoded configs to use {{ConfigEntry}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26477) Use ConfigEntry for hardcoded configs for unsafe category.
[ https://issues.apache.org/jira/browse/SPARK-26477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730340#comment-16730340 ] Kazuaki Ishizaki commented on SPARK-26477: -- I will work for this > Use ConfigEntry for hardcoded configs for unsafe category. > -- > > Key: SPARK-26477 > URL: https://issues.apache.org/jira/browse/SPARK-26477 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-24498. -- Resolution: Won't Do > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703556#comment-16703556 ] Kazuaki Ishizaki commented on SPARK-24498: -- I see. Let us close for now. We may need some strategy to choose better java bytecode compiler at runtime. > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26212) Upgrade maven from 3.5.4 to 3.6.0
Kazuaki Ishizaki created SPARK-26212: Summary: Upgrade maven from 3.5.4 to 3.6.0 Key: SPARK-26212 URL: https://issues.apache.org/jira/browse/SPARK-26212 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Kazuaki Ishizaki Since maven 3.6.0 has been released on 2018 Oct, it would be good to use the latest one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24255) Require Java 8 in SparkR description
[ https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690437#comment-16690437 ] Kazuaki Ishizaki commented on SPARK-24255: -- I do not know an existing library to parse output of {{java -version}}. You may want to know the difference between OpenJDK and Oracle JDK, as shown [here|https://stackoverflow.com/questions/36445502/bash-command-to-check-if-oracle-or-openjdk-java-version-is-installed-on-linux] and [there|https://qiita.com/mao172/items/42aa841280dc5a4e9924]. Output of OpenJDK 12-ea. {code} $ ../OpenJDK-12/java -version openjdk version "12-ea" 2019-03-19 OpenJDK Runtime Environment (build 12-ea+20) OpenJDK 64-Bit Server VM (build 12-ea+20, mixed mode, sharing) $ ../OpenJDK-12/java Version jave.specification.version=12 jave.version=12-ea jave.version.split(".")[0]=12-ea {code} > Require Java 8 in SparkR description > > > Key: SPARK-24255 > URL: https://issues.apache.org/jira/browse/SPARK-24255 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > CRAN checks require that the Java version be set both in package description > and checked during runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25776) The disk write buffer size must be greater than 12.
[ https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674442#comment-16674442 ] Kazuaki Ishizaki commented on SPARK-25776: -- Issue resolved by pull request 22754 https://github.com/apache/spark/pull/22754 > The disk write buffer size must be greater than 12. > --- > > Key: SPARK-25776 > URL: https://issues.apache.org/jira/browse/SPARK-25776 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 3.0.0 > > > In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a > record to a spill file wtih {{ {color:#205081}void write(Object baseObject, > long baseOffset, int recordLength, long keyPrefix{color})}}, > {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} > {color}will be written the disk write buffer first, and these will take 12 > bytes, so the disk write buffer size must be greater than 12. > If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this > exception info: > _java.lang.ArrayIndexOutOfBoundsException: 10_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer > (UnsafeSorterSpillWriter.java:91)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_ > _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25776) The disk write buffer size must be greater than 12.
[ https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25776: - Fix Version/s: 3.0.0 > The disk write buffer size must be greater than 12. > --- > > Key: SPARK-25776 > URL: https://issues.apache.org/jira/browse/SPARK-25776 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 3.0.0 > > > In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a > record to a spill file wtih {{ {color:#205081}void write(Object baseObject, > long baseOffset, int recordLength, long keyPrefix{color})}}, > {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} > {color}will be written the disk write buffer first, and these will take 12 > bytes, so the disk write buffer size must be greater than 12. > If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this > exception info: > _java.lang.ArrayIndexOutOfBoundsException: 10_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer > (UnsafeSorterSpillWriter.java:91)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_ > _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25776) The disk write buffer size must be greater than 12.
[ https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-25776. -- Resolution: Fixed Assignee: liuxian > The disk write buffer size must be greater than 12. > --- > > Key: SPARK-25776 > URL: https://issues.apache.org/jira/browse/SPARK-25776 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > > In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a > record to a spill file wtih {{ {color:#205081}void write(Object baseObject, > long baseOffset, int recordLength, long keyPrefix{color})}}, > {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} > {color}will be written the disk write buffer first, and these will take 12 > bytes, so the disk write buffer size must be greater than 12. > If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this > exception info: > _java.lang.ArrayIndexOutOfBoundsException: 10_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer > (UnsafeSorterSpillWriter.java:91)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_ > _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663394#comment-16663394 ] Kazuaki Ishizaki commented on SPARK-25829: -- I am curious about behavior in other systems such as Presto. Here are test cases for [array|https://github.com/prestodb/presto/blob/master/presto-main/src/test/java/com/facebook/presto/type/TestArrayOperators.java] and [map|https://github.com/prestodb/presto/blob/master/presto-main/src/test/java/com/facebook/presto/type/TestMapOperators.java]. > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663368#comment-16663368 ] Kazuaki Ishizaki commented on SPARK-25829: -- cc [~ueshin] > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25824) Remove duplicated map entries in `showString`
[ https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663361#comment-16663361 ] Kazuaki Ishizaki commented on SPARK-25824: -- cc [~ueshin] During the implementation of functions regarding array/map in Spark 2.4, the community discussed how the duplicated key should be treated. IIUC, the current SparkSQL does not define the behavior. > Remove duplicated map entries in `showString` > - > > Key: SPARK-25824 > URL: https://issues.apache.org/jira/browse/SPARK-25824 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `showString` doesn't eliminate the duplication. So, it looks different from > the result of `collect` and select from saved rows. > *Spark 2.2.2* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("SELECT map(1,2,1,3)").show > +---+ > |map(1, 2, 1, 3)| > +---+ > |Map(1 -> 3)| > +---+ > {code} > *Spark 2.3.0 ~ 2.4.0-rc4* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a") > scala> sql("SELECT * FROM m").show > ++ > | a| > ++ > |[1 -> 3]| > ++ > scala> sql("SELECT map(1,2,1,3)").show > ++ > | map(1, 2, 1, 3)| > ++ > |[1 -> 2, 1 -> 3]| > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649428#comment-16649428 ] Kazuaki Ishizaki commented on SPARK-25728: -- cc: [~viirya], [~cloud_fan], [~mgaido], [~hyukjin.kwon], [~smilegator], [~rednaxelafx] > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > This JIRA entry is to start a discussion about adding structure intermediate > representation for generating Java code from a program using DataFrame or > Dataset API, in addition to the current String-based representation. > This addition is based on the discussions in [a > thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196]. > Please feel free to comment on this JIRA entry or [Google > Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing], > too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25728: - Comment: was deleted (was: This JIRA entry is to start a discussion about adding structure intermediate representation for generating Java code from a program using DataFrame or Dataset API, in addition to the current String-based representation. This addition is based on the discussions in [a thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196]. Please feel free to comment on this JIRA entry or [Google Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing], too. ) > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25728: - Description: This JIRA entry is to start a discussion about adding structure intermediate representation for generating Java code from a program using DataFrame or Dataset API, in addition to the current String-based representation. This addition is based on the discussions in [a thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196]. Please feel free to comment on this JIRA entry or [Google Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing], too. > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > This JIRA entry is to start a discussion about adding structure intermediate > representation for generating Java code from a program using DataFrame or > Dataset API, in addition to the current String-based representation. > This addition is based on the discussions in [a > thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196]. > Please feel free to comment on this JIRA entry or [Google > Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing], > too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649422#comment-16649422 ] Kazuaki Ishizaki commented on SPARK-25728: -- This JIRA entry is to start a discussion about adding structure intermediate representation for generating Java code from a program using DataFrame or Dataset API, in addition to the current String-based representation. This addition is based on the discussions in [a thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196]. Please feel free to comment on this JIRA entry or [Google Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing], too. > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25728: - External issue URL: (was: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing) > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25728: - External issue URL: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing External issue ID: (was: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing) > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25728: - External issue ID: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
Kazuaki Ishizaki created SPARK-25728: Summary: SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code Key: SPARK-25728 URL: https://issues.apache.org/jira/browse/SPARK-25728 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Kazuaki Ishizaki -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs
[ https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-25497. -- Resolution: Fixed Fix Version/s: 3.0.0 > limit operation within whole stage codegen should not consume all the inputs > > > Key: SPARK-25497 > URL: https://issues.apache.org/jira/browse/SPARK-25497 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > This issue was discovered during https://github.com/apache/spark/pull/21738 . > It turns out that limit is not whole-stage-codegened correctly and always > consume all the inputs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs
[ https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reassigned SPARK-25497: Assignee: Wenchen Fan > limit operation within whole stage codegen should not consume all the inputs > > > Key: SPARK-25497 > URL: https://issues.apache.org/jira/browse/SPARK-25497 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > This issue was discovered during https://github.com/apache/spark/pull/21738 . > It turns out that limit is not whole-stage-codegened correctly and always > consume all the inputs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634344#comment-16634344 ] Kazuaki Ishizaki edited comment on SPARK-25538 at 10/1/18 5:21 PM: --- This test case does not print {{63}} using master branch. {code} test("test2") { val df = spark.read.parquet("file:///SPARK-25538-repro") val c1 = df.distinct.count val c2 = df.sort("col_0").distinct.count val c3 = df.withColumnRenamed("col_0", "new").distinct.count val c0 = df.count print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n") } c1=64, c2=73, c3=64, c0=123 {code} was (Author: kiszk): This test case does not print {{63}}. {code} test("test2") { val df = spark.read.parquet("file:///SPARK-25538-repro") val c1 = df.distinct.count val c2 = df.sort("col_0").distinct.count val c3 = df.withColumnRenamed("col_0", "new").distinct.count val c0 = df.count print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n") } c1=64, c2=73, c3=64, c0=123 {code} > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Blocker > Labels: correctness > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634344#comment-16634344 ] Kazuaki Ishizaki commented on SPARK-25538: -- This test case does not print {{63}}. {code} test("test2") { val df = spark.read.parquet("file:///SPARK-25538-repro") val c1 = df.distinct.count val c2 = df.sort("col_0").distinct.count val c3 = df.withColumnRenamed("col_0", "new").distinct.count val c0 = df.count print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n") } c1=64, c2=73, c3=64, c0=123 {code} > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Blocker > Labels: correctness > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633568#comment-16633568 ] Kazuaki Ishizaki commented on SPARK-25538: -- Thank you. I will check it tonight in Japan. > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631605#comment-16631605 ] Kazuaki Ishizaki commented on SPARK-25538: -- Thank for upload a schema. While I looked at the schema, I am still not sure about the reason of this problem. I would appreciate it if you could find a good input data that can reproduce a problem. > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629281#comment-16629281 ] Kazuaki Ishizaki commented on SPARK-25538: -- Hi [~Steven Rand], would it be possible to share the schema of this DataFrame? > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25487) Refactor PrimitiveArrayBenchmark
[ https://issues.apache.org/jira/browse/SPARK-25487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-25487. -- Resolution: Fixed Assignee: Chenxiao Mao Fix Version/s: 2.5.0 Issue resolved by pull request 22497 https://github.com/apache/spark/pull/22497 > Refactor PrimitiveArrayBenchmark > > > Key: SPARK-25487 > URL: https://issues.apache.org/jira/browse/SPARK-25487 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Fix For: 2.5.0 > > > Refactor PrimitiveArrayBenchmark to use main method and print the output as a > separate file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25432) Consider if using standard getOrCreate from PySpark into JVM SparkSession would simplify code
[ https://issues.apache.org/jira/browse/SPARK-25432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621416#comment-16621416 ] Kazuaki Ishizaki commented on SPARK-25432: -- nit: description seems to be in {{environment}} now. > Consider if using standard getOrCreate from PySpark into JVM SparkSession > would simplify code > - > > Key: SPARK-25432 > URL: https://issues.apache.org/jira/browse/SPARK-25432 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 > Environment: As we saw in > [https://github.com/apache/spark/pull/22295/files] the logic can get a bit > out of sync. It _might_ make sense to try and simplify this so there's less > duplicated logic in Python & Scala around session set up. >Reporter: holdenk >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance
[ https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25437: - Comment: was deleted (was: Is such a feature for major release, not for maintenance release?) > Using OpenHashMap replace HashMap improve Encoder Performance > - > > Key: SPARK-25437 > URL: https://issues.apache.org/jira/browse/SPARK-25437 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: wangjiaochun >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance
[ https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617102#comment-16617102 ] Kazuaki Ishizaki commented on SPARK-25437: -- Is such a feature for major release, not for maintenance release? > Using OpenHashMap replace HashMap improve Encoder Performance > - > > Key: SPARK-25437 > URL: https://issues.apache.org/jira/browse/SPARK-25437 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: wangjiaochun >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method
Kazuaki Ishizaki created SPARK-25444: Summary: Refactor GenArrayData.genCodeToCreateArrayData() method Key: SPARK-25444 URL: https://issues.apache.org/jira/browse/SPARK-25444 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.5.0 Reporter: Kazuaki Ishizaki {{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a temporary Java array to create {{ArrayData}}. It can be eliminated by using {{ArrayData.createArrayData}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611717#comment-16611717 ] Kazuaki Ishizaki commented on SPARK-20184: -- In {{branch-2.4}}, we still see the performance degradation compared to w/o codegen {code:java} OpenJDK 64-Bit Server VM 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz SPARK-20184: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative codegen = T 2915 / 3204 0.0 2915001883.0 1.0X codegen = F 1178 / 1368 0.0 1178020462.0 2.5X {code} > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang >Priority: Major > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches
[ https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611633#comment-16611633 ] Kazuaki Ishizaki commented on SPARK-16196: -- [~cloud_fan] This PR in the Jira entry proposes two fixes # Read data in a table cache directry from a columnar storage # Generate code to build a table cache We already implemented 1. But, we have not implmented 2. yet. Let us address 2. in the next release. > Optimize in-memory scan performance using ColumnarBatches > - > > Key: SPARK-16196 > URL: https://issues.apache.org/jira/browse/SPARK-16196 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Major > > A simple benchmark such as the following reveals inefficiencies in the > existing in-memory scan implementation: > {code} > spark.range(N) > .selectExpr("id", "floor(rand() * 1) as k") > .createOrReplaceTempView("test") > val ds = spark.sql("select count(k), count(id) from test").cache() > ds.collect() > ds.collect() > {code} > There are many reasons why caching is slow. The biggest is that compression > takes a long time. The second is that there are a lot of virtual function > calls in this hot code path since the rows are processed using iterators. > Further, the rows are converted to and from ByteBuffers, which are slow to > read in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611502#comment-16611502 ] Kazuaki Ishizaki commented on SPARK-20184: -- Although I created another JIRA https://issues.apache.org/jira/browse/SPARK-20479, there is no PR. Let me check the performance in 2.4 branch. > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang >Priority: Major > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches
[ https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611494#comment-16611494 ] Kazuaki Ishizaki commented on SPARK-16196: -- I see. I will check this. > Optimize in-memory scan performance using ColumnarBatches > - > > Key: SPARK-16196 > URL: https://issues.apache.org/jira/browse/SPARK-16196 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Major > > A simple benchmark such as the following reveals inefficiencies in the > existing in-memory scan implementation: > {code} > spark.range(N) > .selectExpr("id", "floor(rand() * 1) as k") > .createOrReplaceTempView("test") > val ds = spark.sql("select count(k), count(id) from test").cache() > ds.collect() > ds.collect() > {code} > There are many reasons why caching is slow. The biggest is that compression > takes a long time. The second is that there are a lot of virtual function > calls in this hot code path since the rows are processed using iterators. > Further, the rows are converted to and from ByteBuffers, which are slow to > read in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result
Kazuaki Ishizaki created SPARK-25388: Summary: checkEvaluation may miss incorrect nullable of DataType in the result Key: SPARK-25388 URL: https://issues.apache.org/jira/browse/SPARK-25388 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.0.0 Reporter: Kazuaki Ishizaki Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in {{checkEvaluationWithUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604120#comment-16604120 ] Kazuaki Ishizaki commented on SPARK-25317: -- When I have been investigating this issue, I realized that # of Javabyte code size in a method can change performance. I guess that this issue is related to method inlining. However, I have not found the root cause yet. [~mgaido] Would it be possible to submit a PR to fix this issue if possible? > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25338) Several tests miss calling super.afterAll() in their afterAll() method
Kazuaki Ishizaki created SPARK-25338: Summary: Several tests miss calling super.afterAll() in their afterAll() method Key: SPARK-25338 URL: https://issues.apache.org/jira/browse/SPARK-25338 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki The following tests under {{external}} may not call {{super.afterAll()}} in their {{afterAll()}} method. {code} external/flume/src/test/scala/org/apache/spark/streaming/flume/FlumePollingStreamSuite.scala external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaClusterSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisInputDStreamBuilderSuite.scala external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602567#comment-16602567 ] Kazuaki Ishizaki commented on SPARK-25317: -- I confirmed this performance difference even after adding warmup. Let me investigate furthermore. > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602506#comment-16602506 ] Kazuaki Ishizaki commented on SPARK-25317: -- Let me run this on 2.3 and master. One question. This benchmark does not have an warm up loop. In other words, this benchmark may include execution time on an interpreter, too. Is this behavior intentional? > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25317: - Description: eThere is a performance regression when calculating hash code for UTF8String: {code:java} test("hashing") { import org.apache.spark.unsafe.hash.Murmur3_x86_32 import org.apache.spark.unsafe.types.UTF8String val hasher = new Murmur3_x86_32(0) val str = UTF8String.fromString("b" * 10001) val numIter = 10 val start = System.nanoTime for (i <- 0 until numIter) { Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) } val duration = (System.nanoTime() - start) / 1000 / numIter println(s"duration $duration us") } {code} To run this test in 2.3, we need to add {code:java} public static int hashUTF8String(UTF8String str, int seed) { return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), str.numBytes(), seed); } {code} to `Murmur3_x86_32` In my laptop, the result for master vs 2.3 is: 120 us vs 40 us was: There is a performance regression when calculating hash code for UTF8String: {code} test("hashing") { import org.apache.spark.unsafe.hash.Murmur3_x86_32 import org.apache.spark.unsafe.types.UTF8String val hasher = new Murmur3_x86_32(0) val str = UTF8String.fromString("b" * 10001) val numIter = 10 val start = System.nanoTime for (i <- 0 until numIter) { Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) } val duration = (System.nanoTime() - start) / 1000 / numIter println(s"duration $duration us") } {code} To run this test in 2.3, we need to add {code} public static int hashUTF8String(UTF8String str, int seed) { return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), str.numBytes(), seed); } {code} to `Murmur3_x86_32` In my laptop, the result for master vs 2.3 is: 120 us vs 40 us > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import
[jira] [Updated] (SPARK-25310) ArraysOverlap may throw a CompileException
[ https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25310: - Description: Invoking {{ArraysOverlap}} function with non-nullable array type throws the following error in the code generation phase. {code:java} Code generation of arrays_overlap([1,2,3], [4,5,3]) failed: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an rvalue java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an rvalue at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:48) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:32) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1260) {code} > ArraysOverlap may throw a CompileException > -- > > Key: SPARK-25310 > URL: https://issues.apache.org/jira/browse/SPARK-25310 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > Invoking {{ArraysOverlap}} function with non-nullable array type throws the > following error in the code generation phase. > {code:java} > Code generation of arrays_overlap([1,2,3], [4,5,3]) failed: > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) > at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) > at >
[jira] [Updated] (SPARK-25310) ArraysOverlap may throw a CompileException
[ https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25310: - Summary: ArraysOverlap may throw a CompileException (was: ArraysOverlap throws an Exception) > ArraysOverlap may throw a CompileException > -- > > Key: SPARK-25310 > URL: https://issues.apache.org/jira/browse/SPARK-25310 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25310) ArraysOverlap throws an Exception
Kazuaki Ishizaki created SPARK-25310: Summary: ArraysOverlap throws an Exception Key: SPARK-25310 URL: https://issues.apache.org/jira/browse/SPARK-25310 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25178) Directly ship the StructType objects of the keySchema / valueSchema for xxxHashMapGenerator
[ https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25178: - Summary: Directly ship the StructType objects of the keySchema / valueSchema for xxxHashMapGenerator (was: Use dummy name for xxxHashMapGenerator key/value schema field) > Directly ship the StructType objects of the keySchema / valueSchema for > xxxHashMapGenerator > --- > > Key: SPARK-25178 > URL: https://issues.apache.org/jira/browse/SPARK-25178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > Following SPARK-18952 and SPARK-22273, this ticket proposes to change the > generated field name of the keySchema / valueSchema to a dummy name instead > of using {{key.name}}. > In previous discussion from SPARK-18952's PR [1], it was already suggested > that the field names were being used, so it's not worth capturing the strings > as reference objects here. Josh suggested merging the original fix as-is due > to backportability / pickability concerns. Now that we're coming up to a new > release, this can be revisited. > [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field
[ https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587897#comment-16587897 ] Kazuaki Ishizaki commented on SPARK-25178: -- [~rednaxelafx] Thank you for opening a JIRA entry :) [~smilegator] I can take this. > Use dummy name for xxxHashMapGenerator key/value schema field > - > > Key: SPARK-25178 > URL: https://issues.apache.org/jira/browse/SPARK-25178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > Following SPARK-18952 and SPARK-22273, this ticket proposes to change the > generated field name of the keySchema / valueSchema to a dummy name instead > of using {{key.name}}. > In previous discussion from SPARK-18952's PR [1], it was already suggested > that the field names were being used, so it's not worth capturing the strings > as reference objects here. Josh suggested merging the original fix as-is due > to backportability / pickability concerns. Now that we're coming up to a new > release, this can be revisited. > [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt
[ https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25036: - Description: When compiling with sbt, the following errors occur: There are -two- three types: 1. {{ExprValue.isNull}} is compared with unexpected type. 2. {{match may not be exhaustive}} is detected at {{match}} 3. discarding unmoored doc comment The first one is more serious since it may also generate incorrect code in Spark 2.3. {code:java} [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63: match may not be exhaustive. [error] It would fail on the following inputs: (NumericValueInterval(_, _), _), (_, NumericValueInterval(_, _)), (_, _) [error] [warn] def isIntersected(r1: ValueInterval, r2: ValueInterval): Boolean = (r1, r2) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79: match may not be exhaustive. [error] It would fail on the following inputs: (NumericValueInterval(_, _), _), (_, NumericValueInterval(_, _)), (_, _) [error] [warn] (r1, r2) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67: match may not be exhaustive. [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, ArrayData()), (_, _) [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470: match may not be exhaustive. [error] It would fail on the following inputs: NewFunctionSpec(_, None, Some(_)), NewFunctionSpec(_, Some(_), None) [error] [warn] newFunction match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely always compare unequal [error] [warn] if (eval.isNull != "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (eval.isNull == "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (eval.isNull == "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709: match may not be exhaustive. [error] It would fail on the following input: Schema((x: org.apache.spark.sql.types.DataType forSome x not in org.apache.spark.sql.types.StructType), _) [error] [warn] def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (inputs.map(_.isNull).forall(_ == "false")) { [error] [warn] {code} {code:java} [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:410: discarding unmoored doc comment [error] [warn] /** [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:441: discarding unmoored doc comment [error] [warn] /** [error] [warn] ... [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:440: discarding unmoored doc comment [error] [warn] /** [error] [warn] {code} was: When compiling with sbt, the following errors occur: There are two types: 1. {{ExprValue.isNull}} is compared with unexpected type. 1. {{match may not be exhaustive}} is detected at {{match}} The first one is more serious since it may also generate incorrect code in Spark 2.3. {code} [error] [warn]
[jira] [Comment Edited] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt
[ https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575137#comment-16575137 ] Kazuaki Ishizaki edited comment on SPARK-25036 at 8/9/18 5:05 PM: -- Another type of compilation error is found. Added the log to the description was (Author: kiszk): Another type of compilation error is found > Scala 2.12 issues: Compilation error with sbt > - > > Key: SPARK-25036 > URL: https://issues.apache.org/jira/browse/SPARK-25036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.4.0 > > > When compiling with sbt, the following errors occur: > There are two types: > 1. {{ExprValue.isNull}} is compared with unexpected type. > 1. {{match may not be exhaustive}} is detected at {{match}} > The first one is more serious since it may also generate incorrect code in > Spark 2.3. > {code} > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63: > match may not be exhaustive. > [error] It would fail on the following inputs: (NumericValueInterval(_, _), > _), (_, NumericValueInterval(_, _)), (_, _) > [error] [warn] def isIntersected(r1: ValueInterval, r2: ValueInterval): > Boolean = (r1, r2) match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79: > match may not be exhaustive. > [error] It would fail on the following inputs: (NumericValueInterval(_, _), > _), (_, NumericValueInterval(_, _)), (_, _) > [error] [warn] (r1, r2) match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67: > match may not be exhaustive. > [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, > ArrayData()), (_, _) > [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) > match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470: > match may not be exhaustive. > [error] It would fail on the following inputs: NewFunctionSpec(_, None, > Some(_)), NewFunctionSpec(_, Some(_), None) > [error] [warn] newFunction match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely always compare unequal > [error] [warn] if (eval.isNull != "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn] if (eval.isNull == "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn] if (eval.isNull == "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709: > match may not be exhaustive. > [error] It would fail on the following input: Schema((x: > org.apache.spark.sql.types.DataType forSome x not in > org.apache.spark.sql.types.StructType), _) > [error] [warn] def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] > match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn] if (inputs.map(_.isNull).forall(_ == "false")) { > [error] [warn] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Reopened] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt
[ https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reopened SPARK-25036: -- Another type of compilation error is found > Scala 2.12 issues: Compilation error with sbt > - > > Key: SPARK-25036 > URL: https://issues.apache.org/jira/browse/SPARK-25036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.4.0 > > > When compiling with sbt, the following errors occur: > There are two types: > 1. {{ExprValue.isNull}} is compared with unexpected type. > 1. {{match may not be exhaustive}} is detected at {{match}} > The first one is more serious since it may also generate incorrect code in > Spark 2.3. > {code} > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63: > match may not be exhaustive. > [error] It would fail on the following inputs: (NumericValueInterval(_, _), > _), (_, NumericValueInterval(_, _)), (_, _) > [error] [warn] def isIntersected(r1: ValueInterval, r2: ValueInterval): > Boolean = (r1, r2) match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79: > match may not be exhaustive. > [error] It would fail on the following inputs: (NumericValueInterval(_, _), > _), (_, NumericValueInterval(_, _)), (_, _) > [error] [warn] (r1, r2) match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67: > match may not be exhaustive. > [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, > ArrayData()), (_, _) > [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) > match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470: > match may not be exhaustive. > [error] It would fail on the following inputs: NewFunctionSpec(_, None, > Some(_)), NewFunctionSpec(_, Some(_), None) > [error] [warn] newFunction match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely always compare unequal > [error] [warn] if (eval.isNull != "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn] if (eval.isNull == "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn] if (eval.isNull == "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709: > match may not be exhaustive. > [error] It would fail on the following input: Schema((x: > org.apache.spark.sql.types.DataType forSome x not in > org.apache.spark.sql.types.StructType), _) > [error] [warn] def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] > match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn] if (inputs.map(_.isNull).forall(_ == "false")) { > [error] [warn] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25059) Exception while executing an action on DataFrame that read Json
[ https://issues.apache.org/jira/browse/SPARK-25059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575129#comment-16575129 ] Kazuaki Ishizaki commented on SPARK-25059: -- Thank you for reporting the issue. Could you please try this using Spark 2.3? This is because the community extensively investigated and fixed these issues in Spark 2.3 > Exception while executing an action on DataFrame that read Json > --- > > Key: SPARK-25059 > URL: https://issues.apache.org/jira/browse/SPARK-25059 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.2.0 > Environment: AWS EMR 5.8.0 > Spark 2.2.0 > >Reporter: Kunal Goswami >Priority: Major > Labels: Spark-SQL > > When I try to read ~9600 Json files using > {noformat} > val test = spark.read.option("header", true).option("inferSchema", > true).json(paths: _*) {noformat} > > Any action on the above created data frame results in: > {noformat} > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "apply2_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V" > of class "org.apache.spark.sql.catalyst.expressions.Generat[73/1850] > pecificUnsafeProjection" grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:949) > at org.codehaus.janino.CodeContext.write(CodeContext.java:839) > at org.codehaus.janino.UnitCompiler.writeOpcode(UnitCompiler.java:11081) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4546) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436) > at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370) > at org.codehaus.janino.Java$Block.accept(Java.java:2471) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220) > at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$IfStatement.accept(Java.java:2621) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436) > at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370) > at org.codehaus.janino.Java$Block.accept(Java.java:2471) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220) > at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378) > at >
[jira] [Updated] (SPARK-25041) genjavadoc-plugin_0.10 is not found with sbt in scala-2.12
[ https://issues.apache.org/jira/browse/SPARK-25041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25041: - Summary: genjavadoc-plugin_0.10 is not found with sbt in scala-2.12 (was: genjavadoc-plugin_2.12.6 is not found with sbt in scala-2.12) > genjavadoc-plugin_0.10 is not found with sbt in scala-2.12 > -- > > Key: SPARK-25041 > URL: https://issues.apache.org/jira/browse/SPARK-25041 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > When the master is build with sbt in scala-2.12, the following error occurs: > {code} > [warn]module not found: > com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10 > [warn] public: tried > [warn] > https://repo1.maven.org/maven2/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom > [warn] Maven2 Local: tried > [warn] > file:/gsa/jpngsa/home/i/s/ishizaki/.m2/repository/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom > [warn] local: tried > [warn] > /gsa/jpngsa/home/i/s/ishizaki/.ivy2/local/com.typesafe.genjavadoc/genjavadoc-plugin_2.12.6/0.10/ivys/ivy.xml > [info] Resolving jline#jline;2.14.3 ... > [warn]:: > [warn]:: UNRESOLVED DEPENDENCIES :: > [warn]:: > [warn]:: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not > found > [warn]:: > [warn] > [warn]Note: Unresolved dependencies path: > [warn]com.typesafe.genjavadoc:genjavadoc-plugin_2.12.6:0.10 > (/home/ishizaki/Spark/PR/scala212/spark/project/SparkBuild.scala#L118) > [warn] +- org.apache.spark:spark-tags_2.12:2.4.0-SNAPSHOT > sbt.ResolveException: unresolved dependency: > com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found > at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:320) > at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191) > at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168) > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156) > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156) > at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133) > at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57) > at sbt.IvySbt$$anon$4.call(Ivy.scala:65) > at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93) > at > xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78) > at > xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97) > at xsbt.boot.Using$.withResource(Using.scala:10) > at xsbt.boot.Using$.apply(Using.scala:9) > at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58) > at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48) > at xsbt.boot.Locks$.apply0(Locks.scala:31) > at xsbt.boot.Locks$.apply(Locks.scala:28) > at sbt.IvySbt.withDefaultLogger(Ivy.scala:65) > at sbt.IvySbt.withIvy(Ivy.scala:128) > at sbt.IvySbt.withIvy(Ivy.scala:125) > at sbt.IvySbt$Module.withModule(Ivy.scala:156) > at sbt.IvyActions$.updateEither(IvyActions.scala:168) > at > sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1555) > at > sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1551) > at > sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1586) > at > sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1584) > at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37) > at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1589) > at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1583) > at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60) > at sbt.Classpaths$.cachedUpdate(Defaults.scala:1606) > at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1533) > at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1485) > at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) > at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40) > at sbt.std.Transform$$anon$4.work(System.scala:63) > at > sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228) > at > sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228) > at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17) > at sbt.Execute.work(Execute.scala:237) >
[jira] [Created] (SPARK-25041) genjavadoc-plugin_2.12.6 is not found with sbt in scala-2.12
Kazuaki Ishizaki created SPARK-25041: Summary: genjavadoc-plugin_2.12.6 is not found with sbt in scala-2.12 Key: SPARK-25041 URL: https://issues.apache.org/jira/browse/SPARK-25041 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki When the master is build with sbt in scala-2.12, the following error occurs: {code} [warn] module not found: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10 [warn] public: tried [warn] https://repo1.maven.org/maven2/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom [warn] Maven2 Local: tried [warn] file:/gsa/jpngsa/home/i/s/ishizaki/.m2/repository/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom [warn] local: tried [warn] /gsa/jpngsa/home/i/s/ishizaki/.ivy2/local/com.typesafe.genjavadoc/genjavadoc-plugin_2.12.6/0.10/ivys/ivy.xml [info] Resolving jline#jline;2.14.3 ... [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found [warn] :: [warn] [warn] Note: Unresolved dependencies path: [warn] com.typesafe.genjavadoc:genjavadoc-plugin_2.12.6:0.10 (/home/ishizaki/Spark/PR/scala212/spark/project/SparkBuild.scala#L118) [warn]+- org.apache.spark:spark-tags_2.12:2.4.0-SNAPSHOT sbt.ResolveException: unresolved dependency: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:320) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57) at sbt.IvySbt$$anon$4.call(Ivy.scala:65) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97) at xsbt.boot.Using$.withResource(Using.scala:10) at xsbt.boot.Using$.apply(Using.scala:9) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:65) at sbt.IvySbt.withIvy(Ivy.scala:128) at sbt.IvySbt.withIvy(Ivy.scala:125) at sbt.IvySbt$Module.withModule(Ivy.scala:156) at sbt.IvyActions$.updateEither(IvyActions.scala:168) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1555) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1551) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1586) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1584) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1589) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1583) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1606) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1533) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1485) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40) at sbt.std.Transform$$anon$4.work(System.scala:63) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17) at sbt.Execute.work(Execute.scala:237) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159) at sbt.CompletionService$$anon$2.call(CompletionService.scala:28) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at
[jira] [Created] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt
Kazuaki Ishizaki created SPARK-25036: Summary: Scala 2.12 issues: Compilation error with sbt Key: SPARK-25036 URL: https://issues.apache.org/jira/browse/SPARK-25036 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.4.0 Reporter: Kazuaki Ishizaki When compiling with sbt, the following errors occur: There are two types: 1. {{ExprValue.isNull}} is compared with unexpected type. 1. {{match may not be exhaustive}} is detected at {{match}} The first one is more serious since it may also generate incorrect code in Spark 2.3. {code} [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63: match may not be exhaustive. [error] It would fail on the following inputs: (NumericValueInterval(_, _), _), (_, NumericValueInterval(_, _)), (_, _) [error] [warn] def isIntersected(r1: ValueInterval, r2: ValueInterval): Boolean = (r1, r2) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79: match may not be exhaustive. [error] It would fail on the following inputs: (NumericValueInterval(_, _), _), (_, NumericValueInterval(_, _)), (_, _) [error] [warn] (r1, r2) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67: match may not be exhaustive. [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, ArrayData()), (_, _) [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470: match may not be exhaustive. [error] It would fail on the following inputs: NewFunctionSpec(_, None, Some(_)), NewFunctionSpec(_, Some(_), None) [error] [warn] newFunction match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely always compare unequal [error] [warn] if (eval.isNull != "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (eval.isNull == "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (eval.isNull == "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709: match may not be exhaustive. [error] It would fail on the following input: Schema((x: org.apache.spark.sql.types.DataType forSome x not in org.apache.spark.sql.types.StructType), _) [error] [warn] def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (inputs.map(_.isNull).forall(_ == "false")) { [error] [warn] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570440#comment-16570440 ] Kazuaki Ishizaki commented on SPARK-25029: -- [~srowen][~skonto] Thank you for your investigations while I am creating scala-2.12 environment ( I still get compilation errors with scala-2.12 using sbt.) I got the situation... It is related to {{default}} method. We may have to update a method lookup algorithm to consider {{default}} in janino. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391) > at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094) > at
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569621#comment-16569621 ] Kazuaki Ishizaki commented on SPARK-25029: -- [~srowen] I see. The following parts have the method. I will try to see it. My first feeling is that the problem may be in the scala collection library or catalyst Java code generator. {code} ... /* 146 */ final int length_1 = MapObjects_loopValue140.size(); ... /* 315 */ final int length_0 = MapObjects_loopValue140.size(); ... {code} > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391) > at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212) > at >
[jira] [Created] (SPARK-24962) refactor CodeGenerator.createUnsafeArray
Kazuaki Ishizaki created SPARK-24962: Summary: refactor CodeGenerator.createUnsafeArray Key: SPARK-24962 URL: https://issues.apache.org/jira/browse/SPARK-24962 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki {{CodeGenerator.createUnsafeArray()}} generates code for allocating {{UnsafeArrayData}}. This method can support to generate code for allocating {{UnsafeArrayData}} or {{GenericArrayData}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560600#comment-16560600 ] Kazuaki Ishizaki commented on SPARK-24895: -- [~ericfchang] Thank you very much for your suggestion. As the first step, I created [a PR|https://github.com/apache/spark/pull/21905] to upgrade maven. > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4
Kazuaki Ishizaki created SPARK-24956: Summary: Upgrade maven from 3.3.9 to 3.5.4 Key: SPARK-24956 URL: https://issues.apache.org/jira/browse/SPARK-24956 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest. As suggest in SPARK-24895, the current maven will see a problem with some plugins. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559987#comment-16559987 ] Kazuaki Ishizaki commented on SPARK-24895: -- I see. Thank you very much. At first, I will try to make a PR to upgrade a maven. BTW, I have no idea to make sure maven central repo works well for now. > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559974#comment-16559974 ] Kazuaki Ishizaki commented on SPARK-24895: -- [~yhuai] Thank you. BTW, how can I re-enable spotbugs without this problem? Do you have any suggestion? cc: [~hyukjin.kwon] > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24925) input bytesRead metrics fluctuate from time to time
[ https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559972#comment-16559972 ] Kazuaki Ishizaki commented on SPARK-24925: -- Do we need a test case or which test case covers this PR? > input bytesRead metrics fluctuate from time to time > --- > > Key: SPARK-24925 > URL: https://issues.apache.org/jira/browse/SPARK-24925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Major > Attachments: bytesRead.gif > > > input bytesRead metrics fluctuate from time to time, it is worse when > pushdown enabled. > Query > {code:java} > CREATE TABLE dev AS > SELECT > ... > FROM lstg_item cold, lstg_item_vrtn v > WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > ... > {code} > Issue > See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, > 54GB, 53GB ... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24841) Memory leak in converting spark dataframe to pandas dataframe
[ https://issues.apache.org/jira/browse/SPARK-24841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552231#comment-16552231 ] Kazuaki Ishizaki commented on SPARK-24841: -- Thank you for reporting an issue with heap profiling. Would it be possible to post a standalone program that can reproduce this problem? > Memory leak in converting spark dataframe to pandas dataframe > - > > Key: SPARK-24841 > URL: https://issues.apache.org/jira/browse/SPARK-24841 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Running PySpark in standalone mode >Reporter: Piyush Seth >Priority: Minor > > I am running a continuous running application using PySpark. In one of the > operations I have to convert PySpark data frame to Pandas data frame using > toPandas API on pyspark driver. After running for a while I am getting > "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. > I tried running this in a loop and could see that the heap memory is > increasing continuously. When I ran jmap for the first time I had the > following top rows: > num #instances #bytes class name > -- > 1: 1757 411477568 [J > {color:#FF} *2: 124188 266323152 [C*{color} > 3: 167219 46821320 org.apache.spark.status.TaskDataWrapper > 4: 69683 27159536 [B > 5: 359278 8622672 java.lang.Long > 6: 221808 7097856 > java.util.concurrent.ConcurrentHashMap$Node > 7: 283771 6810504 scala.collection.immutable.$colon$colon > After running several iterations I had the following > num #instances #bytes class name > -- > {color:#FF} *1: 110760 3439887928 [C*{color} > 2: 698 411429088 [J > 3: 238096 6880 org.apache.spark.status.TaskDataWrapper > 4: 68819 24050520 [B > 5: 498308 11959392 java.lang.Long > 6: 292741 9367712 > java.util.concurrent.ConcurrentHashMap$Node > 7: 282878 6789072 scala.collection.immutable.$colon$colon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org