[jira] [Commented] (SPARK-44212) Upgrade netty dependencies to 4.1.94.Final due to CVE-2023-34462

2023-06-27 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737755#comment-17737755
 ] 

Kazuaki Ishizaki commented on SPARK-44212:
--

[https://github.com/apache/spark/pull/41681#pullrequestreview-1496876723|http://example.com]
 is discussing the upgrade of netty.

> Upgrade netty dependencies to 4.1.94.Final due to CVE-2023-34462
> 
>
> Key: SPARK-44212
> URL: https://issues.apache.org/jira/browse/SPARK-44212
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Raúl Cumplido
>Priority: Major
>
> Hi,
> On the Apache Arrow project we have noticed that our nightly integration 
> tests with spark started failing lately. With some investigation I've noticed 
> that we are defining a different version of the Java netty dependencies. We 
> upgraded to 4.1.94.Final due to the CVE on the title: 
> [https://github.com/advisories/GHSA-6mjq-h674-j845]
> Our PR upgrading the version: [https://github.com/apache/arrow/issues/36209]
> I have opened  an issue on the Apache Arrow repository to try and fix 
> something else on our side but I was wondering if you would want to update 
> the version to solve the CVE.
>  
> Thanks
> Raúl



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide

2020-11-29 Thread Kazuaki Ishizaki (Jira)
Kazuaki Ishizaki created SPARK-33590:


 Summary: Missing submenus for Performance Tuning in Spark SQL 
Guide 
 Key: SPARK-33590
 URL: https://issues.apache.org/jira/browse/SPARK-33590
 Project: Spark
  Issue Type: Bug
  Components: docs
Affects Versions: 3.0.1, 3.0.0
Reporter: Kazuaki Ishizaki
 Attachments: image-2020-11-30-00-04-07-969.png

Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query Execution) 
are missing 

!image-2020-11-30-00-03-04-814.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide

2020-11-29 Thread Kazuaki Ishizaki (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-33590:
-
Attachment: image-2020-11-30-00-04-07-969.png

> Missing submenus for Performance Tuning in Spark SQL Guide 
> ---
>
> Key: SPARK-33590
> URL: https://issues.apache.org/jira/browse/SPARK-33590
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Kazuaki Ishizaki
>Priority: Minor
> Attachments: image-2020-11-30-00-04-07-969.png
>
>
> Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query 
> Execution) are missing 
> !image-2020-11-30-00-03-04-814.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33590) Missing submenus for Performance Tuning in Spark SQL Guide

2020-11-29 Thread Kazuaki Ishizaki (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-33590:
-
Description: 
Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query Execution) 
are missing 

!image-2020-11-30-00-04-07-969.png!

  was:
Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query Execution) 
are missing 

!image-2020-11-30-00-03-04-814.png!


> Missing submenus for Performance Tuning in Spark SQL Guide 
> ---
>
> Key: SPARK-33590
> URL: https://issues.apache.org/jira/browse/SPARK-33590
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Kazuaki Ishizaki
>Priority: Minor
> Attachments: image-2020-11-30-00-04-07-969.png
>
>
> Sub-menus for \{Coalesce Hints for SQL Queries} and {Adaptive Query 
> Execution) are missing 
> !image-2020-11-30-00-04-07-969.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0

2020-09-02 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189784#comment-17189784
 ] 

Kazuaki Ishizaki commented on SPARK-32312:
--

I think that [this|https://github.com/apache/arrow/pull/7746] is a work to 
succeed a build. What work do we need further?

> Upgrade Apache Arrow to 1.0.0
> -
>
> Key: SPARK-32312
> URL: https://issues.apache.org/jira/browse/SPARK-32312
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Apache Arrow will soon release v1.0.0 which provides backward/forward 
> compatibility guarantees as well as a number of fixes and improvements. This 
> will upgrade the Java artifact and PySpark API. Although PySpark will not 
> need special changes, it might be a good idea to bump up minimum supported 
> version and CI testing.
> TBD: list of important improvements and fixes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases

2020-04-30 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096707#comment-17096707
 ] 

Kazuaki Ishizaki commented on SPARK-31538:
--

Do we want to port this to 2.4.6 and 3.0?

> Backport SPARK-25338   Ensure to call super.beforeAll() and 
> super.afterAll() in test cases
> --
>
> Key: SPARK-31538
> URL: https://issues.apache.org/jira/browse/SPARK-31538
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25338       Ensure to call super.beforeAll() and 
> super.afterAll() in test cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases

2020-04-24 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091699#comment-17091699
 ] 

Kazuaki Ishizaki commented on SPARK-31538:
--

We could backport this. On the other hand, this is not a bug fix. As far as I 
know, this change does not find new issues immediately.  
If we have already found problems related to this, they should have been 
backported to the 2.4 branch.

I think that this is a nice-to-have in the maintenance branch.

> Backport SPARK-25338   Ensure to call super.beforeAll() and 
> super.afterAll() in test cases
> --
>
> Key: SPARK-31538
> URL: https://issues.apache.org/jira/browse/SPARK-31538
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25338       Ensure to call super.beforeAll() and 
> super.afterAll() in test cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2020-03-18 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061914#comment-17061914
 ] 

Kazuaki Ishizaki commented on SPARK-25728:
--

For now, no update on my side.

I am happy if the community or PMCs want to revitalize it. Otherwise, should I 
close this?

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry is to start a discussion about adding structure intermediate 
> representation for generating Java code from a program using DataFrame or 
> Dataset API, in addition to the current String-based representation.
> This addition is based on the discussions in [a 
> thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].
> Please feel free to comment on this JIRA entry or [Google 
> Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
>  too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-12 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058174#comment-17058174
 ] 

Kazuaki Ishizaki commented on SPARK-25987:
--

Sorry for coming here late since I am busy these days (hopefully by tomorrow).

The analysis looks correct due to the recursive call at  
https://github.com/janino-compiler/janino/blob/ccb4931fd605ed8081839f962b57ac3734db78ee/janino/src/main/java/org/codehaus/janino/CodeContext.java#L385.
 As [~kabhwan] suggested, to add {{-Xss2m}} is conservative and safe workaround.



> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-12 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035464#comment-17035464
 ] 

Kazuaki Ishizaki commented on SPARK-30711:
--

The reason to show the exception as an error is whether the regular or the 
error path is decided after this exception occurs. Let us think whether we can 
show this exception will be shown later when the regular or the error path is 
decided.   
Of course, this change requires that agreement in the community.

The way to avoid the bytecode growth is to reduce the size of one SQL 
statement. In this case, can you reduce # of when clause in one SQL?

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> 

[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-08 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033112#comment-17033112
 ] 

Kazuaki Ishizaki commented on SPARK-30711:
--

[~schreiber] Sorry, I made a mistake. This test case can pass with master and 
branch-2.4 in my end.

I have one question. Which value do you set into {{spark.sql.codegen.fallback}} 
?  The idea of the whole-stage codegen is stop using the whole-stage codegen if 
the generated code is larger than 64KB. For it, [this 
code|https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L600-L607]
 catches the {{org.codehaus.janino.InternalCompilerException}} and tries to 
recompile the code with smaller pieces.

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at 

[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-05 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031276#comment-17031276
 ] 

Kazuaki Ishizaki commented on SPARK-30711:
--

Now, I am looking at this with master branch at first.

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> 

[jira] [Comment Edited] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-04 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029724#comment-17029724
 ] 

Kazuaki Ishizaki edited comment on SPARK-30711 at 2/4/20 10:20 AM:
---

In my environment, both v3.0.0-preview `007c873a` and master `6097b343` 
branches cause the exception.


was (Author: kiszk):
In my environment, both v3.0.0-preview and master branches cause the exception.

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> 

[jira] [Comment Edited] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-04 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029724#comment-17029724
 ] 

Kazuaki Ishizaki edited comment on SPARK-30711 at 2/4/20 10:19 AM:
---

In my environment, both v3.0.0-preview and master branches cause the exception.


was (Author: kiszk):
In my environment, both v3.0.0-preview and master branches causes the exception.

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> 

[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-04 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029724#comment-17029724
 ] 

Kazuaki Ishizaki commented on SPARK-30711:
--

In my environment, both v3.0.0-preview and master branches causes the exception.

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> 

[jira] [Commented] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2020-02-03 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028784#comment-17028784
 ] 

Kazuaki Ishizaki commented on SPARK-22510:
--

[~schreiber] Thank you for reporting the problem. Could you please share a 
program that causes this problem? We want to know which operators cause this 
issue at first.

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: bulk-closed, releasenotes
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-09-02 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920594#comment-16920594
 ] 

Kazuaki Ishizaki edited comment on SPARK-28906 at 9/2/19 8:48 AM:
--

For the information on {{git}} comand, {{.git}} directory is deleted after 
{{git clone}} is executed. As a result, we cannot get infomration on {{git}} 
command. When I tentatively stop deleting {{.git}} directory, 
{{spark-version-info.properties}} can include the correct information like:
{code}
version=2.3.4
user=ishizaki
revision=8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210
branch=HEAD
date=2019-09-02T02:31:25Z
url=https://gitbox.apache.org/repos/asf/spark.git
{code}



was (Author: kiszk):
For the information on {{git}} comand, {{.git}} directory is deleted after 
{{git clone}} is executed. When I tentatively stop deleting {{.git}} directory, 
{{spark-version-info.properties}} can include the correct information like:
{code}
version=2.3.4
user=ishizaki
revision=8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210
branch=HEAD
date=2019-09-02T02:31:25Z
url=https://gitbox.apache.org/repos/asf/spark.git
{code}


> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-09-01 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920594#comment-16920594
 ] 

Kazuaki Ishizaki commented on SPARK-28906:
--

For the information on {{git}} comand, {{.git}} directory is deleted after 
{{git clone}} is executed. When I tentatively stop deleting {{.git}} directory, 
{{spark-version-info.properties}} can include the correct information like:
{code}
version=2.3.4
user=ishizaki
revision=8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210
branch=HEAD
date=2019-09-02T02:31:25Z
url=https://gitbox.apache.org/repos/asf/spark.git
{code}


> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-08-30 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919893#comment-16919893
 ] 

Kazuaki Ishizaki commented on SPARK-28906:
--

For user name, we have to pass {{USER}} environment variable to the docker 
container at the end of {{do-release-docker.sh}}. I created a patch to fix this.

For other information to be got by {{git}} command, {{spark-build-info}} script 
is not executed at the wrong directory (i.e. out of the cloned directory). My 
guess is the command is executed under the work directory. I did not creat a 
patch yet.

> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-08-30 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919420#comment-16919420
 ] 

Kazuaki Ishizaki edited comment on SPARK-28906 at 8/30/19 10:51 AM:


In {{jars/spark-core_2.11-2.3.*.jar}}, {{spark-version-info.properties}} 
exists. This file is different between 2.3.0 and 2.3.4.
This file is generated by `build/spark-build-info`.

{code}
$ cat spark-version-info.properties.230
version=2.3.0
user=sameera
revision=a0d7949896e70f427e7f3942ff340c9484ff0aab
branch=master
date=2018-02-22T19:24:38Z
url=g...@github.com:sameeragarwal/spark.git
$ cat spark-version-info.properties.234
version=2.3.4
user=
revision=
branch=
date=2019-08-26T08:29:39Z
url=
{code}


was (Author: kiszk):
In {{jars/spark-core_2.11-2.3.*.jar}}, {{spark-version-info.properties}} 
exists. This file is different between 2.3.0 and 2.3.4.

{code}
$ cat spark-version-info.properties.230
version=2.3.0
user=sameera
revision=a0d7949896e70f427e7f3942ff340c9484ff0aab
branch=master
date=2018-02-22T19:24:38Z
url=g...@github.com:sameeragarwal/spark.git
$ cat spark-version-info.properties.234
version=2.3.4
user=
revision=
branch=
date=2019-08-26T08:29:39Z
url=
{code}

> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-08-30 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919420#comment-16919420
 ] 

Kazuaki Ishizaki commented on SPARK-28906:
--

In {{jars/spark-core_2.11-2.3.*.jar}}, {{spark-version-info.properties}} 
exists. This file is different between 2.3.0 and 2.3.4.

{code}
$ cat spark-version-info.properties.230
version=2.3.0
user=sameera
revision=a0d7949896e70f427e7f3942ff340c9484ff0aab
branch=master
date=2018-02-22T19:24:38Z
url=g...@github.com:sameeragarwal/spark.git
$ cat spark-version-info.properties.234
version=2.3.4
user=
revision=
branch=
date=2019-08-26T08:29:39Z
url=
{code}

> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-08-30 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919330#comment-16919330
 ] 

Kazuaki Ishizaki commented on SPARK-28906:
--

I attached output of 2.3.0 and 2.3.4 in one comment as below. Let me see the 
script, too.

```
$ spark-2.3.0-bin-hadoop2.6/bin/spark-submit --version
Welcome to
 __
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
 
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_212
Branch master
Compiled by user sameera on 2018-02-22T19:24:38Z
Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
Url g...@github.com:sameeragarwal/spark.git
Type --help for more information.
$ spark-2.3.4-bin-hadoop2.6/bin/spark-submit --version
Welcome to
 __
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.4
/_/
 
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_212
Branch 
Compiled by user on 2019-08-26T08:29:39Z
Revision 
Url 
Type --help for more information.
```

> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28891) do-release-docker.sh in master does not work for branch-2.3

2019-08-28 Thread Kazuaki Ishizaki (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-28891:
-
Description: 
According to [~maropu], 
[do-release-docker.sh|https://github.com/apache/spark/blob/master/dev/create-release/do-release-docker.sh]
 in master branch worked for 2.3.3 release for branch-2.3.

After updates in [this PR|https://github.com/apache/spark/pull/23098], 
{{do-release-docker.sh}} does not work for branch-2.3 now as shown:

{code}
...
Checked out revision 35358.
Copying release tarballs
cp: cannot stat 'pyspark-*': No such file or directory
{code}

  was:
According to [~maropu], 
[do-release-docker.sh|https://github.com/apache/spark/blob/master/dev/create-release/do-release-docker.sh]
 in master branch worked for 2.3.3 release for branch-2.3.

After updates in [this PR|https://github.com/apache/spark/pull/23098], 
{{do-release-docker.sh}} does not work for branch-2.3 now.


> do-release-docker.sh in master does not work for branch-2.3
> ---
>
> Key: SPARK-28891
> URL: https://issues.apache.org/jira/browse/SPARK-28891
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.4
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> According to [~maropu], 
> [do-release-docker.sh|https://github.com/apache/spark/blob/master/dev/create-release/do-release-docker.sh]
>  in master branch worked for 2.3.3 release for branch-2.3.
> After updates in [this PR|https://github.com/apache/spark/pull/23098], 
> {{do-release-docker.sh}} does not work for branch-2.3 now as shown:
> {code}
> ...
> Checked out revision 35358.
> Copying release tarballs
> cp: cannot stat 'pyspark-*': No such file or directory
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28891) do-release-docker.sh in master does not work for branch-2.3

2019-08-28 Thread Kazuaki Ishizaki (Jira)
Kazuaki Ishizaki created SPARK-28891:


 Summary: do-release-docker.sh in master does not work for 
branch-2.3
 Key: SPARK-28891
 URL: https://issues.apache.org/jira/browse/SPARK-28891
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.3.4
Reporter: Kazuaki Ishizaki


According to [~maropu], 
[do-release-docker.sh|https://github.com/apache/spark/blob/master/dev/create-release/do-release-docker.sh]
 in master branch worked for 2.3.3 release for branch-2.3.

After updates in [this PR|https://github.com/apache/spark/pull/23098], 
{{do-release-docker.sh}} does not work for branch-2.3 now.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910983#comment-16910983
 ] 

Kazuaki Ishizaki commented on SPARK-28699:
--

[~dongjoon] Thank you for pointing out my typo. You are right. I should have 
said {2.3.4-rc1}.

Actually, while I was doing the following, it is not reflected at the 
repository yet!  After fixing this, let me restart the release process for 
{2.3.4-rc1}.

{code}
Release details:
BRANCH: branch-2.3
VERSION: 2.3.4
TAG: v2.3.4-rc1
{code}

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910907#comment-16910907
 ] 

Kazuaki Ishizaki commented on SPARK-28699:
--

[~smilegator] Thank you for cc. I wait for fixing this.

I was in the middle of releasing RC1. Thus, there is already {{2.4.4-rc1}} tag 
in the [branch-2.3|https://github.com/apache/spark/tree/branch-2.3]. Should I 
remove this tag and release rc1? Or should I leave this tag and release rc2 at 
first?


> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder, we miss tracking the `isOrderSensitive` 
> characteristic to the newly created MapPartitionsRDD.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>  
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27761) Make UDF nondeterministic by default(?)

2019-05-20 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844396#comment-16844396
 ] 

Kazuaki Ishizaki commented on SPARK-27761:
--

IMHO, it looks good. It is fine for UDF writers to keep current default 
behavior.

> Make UDF nondeterministic by default(?)
> ---
>
> Key: SPARK-27761
> URL: https://issues.apache.org/jira/browse/SPARK-27761
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sunitha Kambhampati
>Priority: Minor
>
> Opening this issue as a followup from a discussion/question on this PR for an 
> optimization involving deterministic udf: 
> https://github.com/apache/spark/pull/24593#pullrequestreview-237361795  
> "We even should discuss whether all UDFs must be deterministic or 
> non-deterministic by default."
> Basically today in Spark 2.4,  Scala UDFs are marked deterministic by default 
> and it is implicit. To mark a udf as non deterministic, they need to call 
> this method asNondeterministic().
> The concern's expressed are that users are not aware of this property and its 
> implications.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27752) Updata lz4-java from 1.5.2 to 1.6.0

2019-05-16 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-27752:


 Summary: Updata lz4-java from 1.5.2 to 1.6.0
 Key: SPARK-27752
 URL: https://issues.apache.org/jira/browse/SPARK-27752
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki


Update lz4-java that is available from https://github.com/lz4/lz4-java.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27752) Updata lz4-java from 1.5.1 to 1.6.0

2019-05-16 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-27752:
-
Summary: Updata lz4-java from 1.5.1 to 1.6.0  (was: Updata lz4-java from 
1.5.2 to 1.6.0)

> Updata lz4-java from 1.5.1 to 1.6.0
> ---
>
> Key: SPARK-27752
> URL: https://issues.apache.org/jira/browse/SPARK-27752
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> Update lz4-java that is available from https://github.com/lz4/lz4-java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27684) Reduce ScalaUDF conversion overheads for primitives

2019-05-12 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838229#comment-16838229
 ] 

Kazuaki Ishizaki commented on SPARK-27684:
--

Interesting idea

> Reduce ScalaUDF conversion overheads for primitives
> ---
>
> Key: SPARK-27684
> URL: https://issues.apache.org/jira/browse/SPARK-27684
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> I believe that we can reduce ScalaUDF overheads when operating over primitive 
> types.
> In [ScalaUDF's 
> doGenCode|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L991]
>  we have logic to convert UDF function input types from Catalyst internal 
> types to Scala types (for example, this is used to convert UTF8Strings to 
> Java Strings). Similarly, we convert UDF return types.
> However, UDF input argument conversion is effectively a no-op for primitive 
> types because {{CatalystTypeConverters.createToScalaConverter()}} returns 
> {{identity}} in those cases. UDF result conversion is a little tricker 
> because {{createToCatalystConverter()}} returns [a 
> function|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L413]
>  that handles {{Option[Primitive]}}, but it might be the case that the 
> Option-boxing is unusable via ScalaUDF (in which case the conversion truly is 
> an {{identity}} no-op).
> These unnecessary no-op conversions could be quite expensive because each 
> call involves an index into the {{references}} array to get the converters, a 
> second index into the converters array to get the correct converter for the 
> nth input argument, and, finally, the converter invocation itself:
> {code:java}
> Object project_arg_0 = false ? null : ((scala.Function1[]) references[1] /* 
> converters */)[0].apply(project_value_3);{code}
> In these cases, I believe that we can reduce lookup / invocation overheads by 
> modifying the ScalaUDF code generation to eliminate the conversion calls for 
> primitives and directly assign the unconverted result, e.g.
> {code:java}
> Object project_arg_0 = false ? null : project_value_3;{code}
> To cleanly handle the case where we have a multi-argument UDF accepting a 
> mixture of primitive and non-primitive types, we might be able to keep the 
> {{converters}} array the same size (so indexes stay the same) but omit the 
> invocation of the converters for the primitive arguments (e.g. {{converters}} 
> is sparse / contains unused entries in case of primitives).
> I spotted this optimization while trying to construct some quick benchmarks 
> to measure UDF invocation overheads. For example:
> {code:java}
> spark.udf.register("identity", (x: Int) => x)
> sql("select id, id * 2, id * 3 from range(1000 * 1000 * 1000)").rdd.count() 
> // ~ 52 seconds
> sql("select identity(id), identity(id * 2), identity(id * 3) from range(1000 
> * 1000 * 1000)").rdd.count() // ~84 seconds{code}
> I'm curious to see whether the optimization suggested here can close this 
> performance gap. It'd also be a good idea to construct more principled 
> microbenchmarks covering multi-argument UDFs, projections involving multiple 
> UDFs over different input and output types, etc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-16 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819181#comment-16819181
 ] 

Kazuaki Ishizaki commented on SPARK-27396:
--

I have one question regarding low-level API. In my understanding, this SPIP 
proposes code generation API for each operation at low-level for exploiting 
columnar storage. How does this SPIP support to store the result of the 
generated code into columnar storage? In particular, for {{genColumnarCode()}}.

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for simple data exchange with other systems, DL/ML libraries, 
> pandas, etc. by having clean APIs to transform the columnar data into an 
> Apache Arrow compatible layout.
>  # Provide a plugin mechanism for columnar processing support so an advanced 
> user could avoid data transition between columnar and row based processing 
> even through shuffles. This means we should at least support pluggable APIs 
> so an advanced end user can implement the columnar partitioning themselves, 
> and provide the glue necessary to shuffle the data still in a columnar format.
>  # Expose new APIs that allow advanced users or frameworks to implement 
> columnar processing either as UDFs, or by adjusting the physical plan to do 
> columnar processing.  If the latter is too controversial we can move it to 
> another SPIP, but we plan to implement some accelerated computing in parallel 
> with this feature to be sure the APIs work, and without this feature it makes 
> that impossible.
>  
> Not Requirements, but things that would be nice to have.
>  # Provide default implementations for partitioning columnar data, so users 
> don’t have to.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  # Provide a clean transition from the existing code to the new one.  The 
> existing APIs which are public but evolving are not that far off from what is 
> being proposed.  We should be able to create a new parallel API that can wrap 
> the existing one. This means any file format that is trying to support 
> columnar can still do so until we make a conscious decision to deprecate and 
> then turn off the old APIs.
>  
> *Q2.* What problem is this proposal NOT designed to solve?
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation, and possibly default 
> implementations for partitioning of columnar shuffle. 
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Input formats, optionally can return a ColumnarBatch instead of rows.  The 
> code generation phase knows how to take that columnar data and iterate 
> through it as rows for stages that wants rows, which currently is almost 
> everything.  The limitations here are mostly implementation specific. The 
> current standard is to abuse Scala’s type erasure to return ColumnarBatches 
> as the elements of an RDD[InternalRow]. The code generation can handle this 
> because it is generating java code, so it bypasses scala’s type checking and 
> just casts the InternalRow to the desired ColumnarBatch.  This makes it 
> difficult for others to implement the same functionality for different 
> processing because they can only do it through code generation. There really 
> is no clean separate path in the code generation for columnar vs row based. 
> Additionally because it is only supported through code generation if for any 
> reason code generation would fail there is no backup.  This is typically fine 
> for input formats but can 

[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-13 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817152#comment-16817152
 ] 

Kazuaki Ishizaki commented on SPARK-27396:
--

Thank you for sharing low-level APIs for Spark implementors. Could you please 
share the current thought on the high-level APIs that Spark application 
developers will use from their Spark applications?

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for simple data exchange with other systems, DL/ML libraries, 
> pandas, etc. by having clean APIs to transform the columnar data into an 
> Apache Arrow compatible layout.
>  # Provide a plugin mechanism for columnar processing support so an advanced 
> user could avoid data transition between columnar and row based processing 
> even through shuffles. This means we should at least support pluggable APIs 
> so an advanced end user can implement the columnar partitioning themselves, 
> and provide the glue necessary to shuffle the data still in a columnar format.
>  # Expose new APIs that allow advanced users or frameworks to implement 
> columnar processing either as UDFs, or by adjusting the physical plan to do 
> columnar processing.  If the latter is too controversial we can move it to 
> another SPIP, but we plan to implement some accelerated computing in parallel 
> with this feature to be sure the APIs work, and without this feature it makes 
> that impossible.
>  
> Not Requirements, but things that would be nice to have.
>  # Provide default implementations for partitioning columnar data, so users 
> don’t have to.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  # Provide a clean transition from the existing code to the new one.  The 
> existing APIs which are public but evolving are not that far off from what is 
> being proposed.  We should be able to create a new parallel API that can wrap 
> the existing one. This means any file format that is trying to support 
> columnar can still do so until we make a conscious decision to deprecate and 
> then turn off the old APIs.
>  
> *Q2.* What problem is this proposal NOT designed to solve?
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation, and possibly default 
> implementations for partitioning of columnar shuffle. 
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Input formats, optionally can return a ColumnarBatch instead of rows.  The 
> code generation phase knows how to take that columnar data and iterate 
> through it as rows for stages that wants rows, which currently is almost 
> everything.  The limitations here are mostly implementation specific. The 
> current standard is to abuse Scala’s type erasure to return ColumnarBatches 
> as the elements of an RDD[InternalRow]. The code generation can handle this 
> because it is generating java code, so it bypasses scala’s type checking and 
> just casts the InternalRow to the desired ColumnarBatch.  This makes it 
> difficult for others to implement the same functionality for different 
> processing because they can only do it through code generation. There really 
> is no clean separate path in the code generation for columnar vs row based. 
> Additionally because it is only supported through code generation if for any 
> reason code generation would fail there is no backup.  This is typically fine 
> for input formats but can be problematic when we get into more extensive 
> processing. 
>  # When caching data it can optionally 

[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815812#comment-16815812
 ] 

Kazuaki Ishizaki commented on SPARK-27396:
--

Sorry for being late to comment since I am traveling. 

Could you please let us know the proposed API changes regarding Q1-4 and Q1-5?

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for simple data exchange with other systems, DL/ML libraries, 
> pandas, etc. by having clean APIs to transform the columnar data into an 
> Apache Arrow compatible layout.
>  # Provide a plugin mechanism for columnar processing support so an advanced 
> user could avoid data transition between columnar and row based processing 
> even through shuffles. This means we should at least support pluggable APIs 
> so an advanced end user can implement the columnar partitioning themselves, 
> and provide the glue necessary to shuffle the data still in a columnar format.
>  # Expose new APIs that allow advanced users or frameworks to implement 
> columnar processing either as UDFs, or by adjusting the physical plan to do 
> columnar processing.  If the latter is too controversial we can move it to 
> another SPIP, but we plan to implement some accelerated computing in parallel 
> with this feature to be sure the APIs work, and without this feature it makes 
> that impossible.
>  
> Not Requirements, but things that would be nice to have.
>  # Provide default implementations for partitioning columnar data, so users 
> don’t have to.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  # Provide a clean transition from the existing code to the new one.  The 
> existing APIs which are public but evolving are not that far off from what is 
> being proposed.  We should be able to create a new parallel API that can wrap 
> the existing one. This means any file format that is trying to support 
> columnar can still do so until we make a conscious decision to deprecate and 
> then turn off the old APIs.
>  
> *Q2.* What problem is this proposal NOT designed to solve?
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation, and possibly default 
> implementations for partitioning of columnar shuffle. 
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Input formats, optionally can return a ColumnarBatch instead of rows.  The 
> code generation phase knows how to take that columnar data and iterate 
> through it as rows for stages that wants rows, which currently is almost 
> everything.  The limitations here are mostly implementation specific. The 
> current standard is to abuse Scala’s type erasure to return ColumnarBatches 
> as the elements of an RDD[InternalRow]. The code generation can handle this 
> because it is generating java code, so it bypasses scala’s type checking and 
> just casts the InternalRow to the desired ColumnarBatch.  This makes it 
> difficult for others to implement the same functionality for different 
> processing because they can only do it through code generation. There really 
> is no clean separate path in the code generation for columnar vs row based. 
> Additionally because it is only supported through code generation if for any 
> reason code generation would fail there is no backup.  This is typically fine 
> for input formats but can be problematic when we get into more extensive 
> processing. 
>  # When caching data it can optionally be cached in a columnar format if the 
> input is also columnar.  

[jira] [Created] (SPARK-27397) Take care of OpenJ9 in JVM dependant parts

2019-04-06 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-27397:


 Summary: Take care of OpenJ9 in JVM dependant parts
 Key: SPARK-27397
 URL: https://issues.apache.org/jira/browse/SPARK-27397
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki


Spark includes multiple JVM dependant code such as {{SizeEstimator}}. The 
current Spark takes care of IBM JDK and OpenJDK. Recently, OpenJ9 has been 
released. However, it is not considered yet. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26508) Address warning messages in Java by lgtm.com

2018-12-31 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-26508:


 Summary: Address warning messages in Java by lgtm.com
 Key: SPARK-26508
 URL: https://issues.apache.org/jira/browse/SPARK-26508
 Project: Spark
  Issue Type: Improvement
  Components: Examples, Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki


[lgtm.com|http://lgtm.com] provides automated code review of 
Java/Python/JavaScript files for OSS projects.

[Here|https://lgtm.com/projects/g/apache/spark/alerts/?mode=list=warning]
 are warning messages regarding Apache Spark project.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26463) Use ConfigEntry for hardcoded configs for scheduler categories.

2018-12-29 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730882#comment-16730882
 ] 

Kazuaki Ishizaki commented on SPARK-26463:
--

I will work for this

> Use ConfigEntry for hardcoded configs for scheduler categories.
> ---
>
> Key: SPARK-26463
> URL: https://issues.apache.org/jira/browse/SPARK-26463
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Make the following hardcoded configs to use {{ConfigEntry}}.
> {code}
> spark.dynamicAllocation
> spark.scheduler
> spark.rpc
> spark.task
> spark.speculation
> spark.cleaner
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26442) Use ConfigEntry for hardcoded configs.

2018-12-29 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730883#comment-16730883
 ] 

Kazuaki Ishizaki commented on SPARK-26442:
--

Thank you for updaing them.

> Use ConfigEntry for hardcoded configs.
> --
>
> Key: SPARK-26442
> URL: https://issues.apache.org/jira/browse/SPARK-26442
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> This umbrella JIRA is to make hardcoded configs to use {{ConfigEntry}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26442) Use ConfigEntry for hardcoded configs.

2018-12-29 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730883#comment-16730883
 ] 

Kazuaki Ishizaki edited comment on SPARK-26442 at 12/30/18 4:18 AM:


Thank you for updating them.


was (Author: kiszk):
Thank you for updaing them.

> Use ConfigEntry for hardcoded configs.
> --
>
> Key: SPARK-26442
> URL: https://issues.apache.org/jira/browse/SPARK-26442
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> This umbrella JIRA is to make hardcoded configs to use {{ConfigEntry}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26477) Use ConfigEntry for hardcoded configs for unsafe category.

2018-12-28 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730340#comment-16730340
 ] 

Kazuaki Ishizaki commented on SPARK-26477:
--

I will work for this

> Use ConfigEntry for hardcoded configs for unsafe category.
> --
>
> Key: SPARK-26477
> URL: https://issues.apache.org/jira/browse/SPARK-26477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24498) Add JDK compiler for runtime codegen

2018-11-29 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-24498.
--
Resolution: Won't Do

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-11-29 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703556#comment-16703556
 ] 

Kazuaki Ishizaki commented on SPARK-24498:
--

I see. Let us close for now.
We may need some strategy to choose better java bytecode compiler at runtime.

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26212) Upgrade maven from 3.5.4 to 3.6.0

2018-11-29 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-26212:


 Summary: Upgrade maven from 3.5.4 to 3.6.0
 Key: SPARK-26212
 URL: https://issues.apache.org/jira/browse/SPARK-26212
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki


Since maven 3.6.0 has been released on 2018 Oct, it would be good to use the 
latest one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24255) Require Java 8 in SparkR description

2018-11-17 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690437#comment-16690437
 ] 

Kazuaki Ishizaki commented on SPARK-24255:
--

I do not know an existing library to parse output of {{java -version}}.
You may want to know the difference between OpenJDK and Oracle JDK, as shown 
[here|https://stackoverflow.com/questions/36445502/bash-command-to-check-if-oracle-or-openjdk-java-version-is-installed-on-linux]
 and [there|https://qiita.com/mao172/items/42aa841280dc5a4e9924].

Output of OpenJDK 12-ea.
{code}
$ ../OpenJDK-12/java -version
openjdk version "12-ea" 2019-03-19
OpenJDK Runtime Environment (build 12-ea+20)
OpenJDK 64-Bit Server VM (build 12-ea+20, mixed mode, sharing)

$ ../OpenJDK-12/java Version
jave.specification.version=12
jave.version=12-ea
jave.version.split(".")[0]=12-ea
{code}

> Require Java 8 in SparkR description
> 
>
> Key: SPARK-24255
> URL: https://issues.apache.org/jira/browse/SPARK-24255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> CRAN checks require that the Java version be set both in package description 
> and checked during runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-11-04 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674442#comment-16674442
 ] 

Kazuaki Ishizaki commented on SPARK-25776:
--

Issue resolved by pull request 22754
https://github.com/apache/spark/pull/22754

> The disk write buffer size must be greater than 12.
> ---
>
> Key: SPARK-25776
> URL: https://issues.apache.org/jira/browse/SPARK-25776
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 3.0.0
>
>
> In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
> record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
> long baseOffset, int recordLength, long keyPrefix{color})}}, 
> {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
> {color}will be written the disk write buffer first, and these will take 12 
> bytes, so the disk write buffer size must be greater than 12.
> If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this 
> exception info:
> _java.lang.ArrayIndexOutOfBoundsException: 10_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
>  (UnsafeSorterSpillWriter.java:91)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
>  _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-11-04 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25776:
-
Fix Version/s: 3.0.0

> The disk write buffer size must be greater than 12.
> ---
>
> Key: SPARK-25776
> URL: https://issues.apache.org/jira/browse/SPARK-25776
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 3.0.0
>
>
> In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
> record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
> long baseOffset, int recordLength, long keyPrefix{color})}}, 
> {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
> {color}will be written the disk write buffer first, and these will take 12 
> bytes, so the disk write buffer size must be greater than 12.
> If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this 
> exception info:
> _java.lang.ArrayIndexOutOfBoundsException: 10_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
>  (UnsafeSorterSpillWriter.java:91)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
>  _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-11-04 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-25776.
--
Resolution: Fixed
  Assignee: liuxian

> The disk write buffer size must be greater than 12.
> ---
>
> Key: SPARK-25776
> URL: https://issues.apache.org/jira/browse/SPARK-25776
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
>
> In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
> record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
> long baseOffset, int recordLength, long keyPrefix{color})}}, 
> {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
> {color}will be written the disk write buffer first, and these will take 12 
> bytes, so the disk write buffer size must be greater than 12.
> If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this 
> exception info:
> _java.lang.ArrayIndexOutOfBoundsException: 10_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
>  (UnsafeSorterSpillWriter.java:91)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
>  _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-25 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663394#comment-16663394
 ] 

Kazuaki Ishizaki commented on SPARK-25829:
--

I am curious about behavior in other systems such as Presto.
Here are test cases for 
[array|https://github.com/prestodb/presto/blob/master/presto-main/src/test/java/com/facebook/presto/type/TestArrayOperators.java]
 and 
[map|https://github.com/prestodb/presto/blob/master/presto-main/src/test/java/com/facebook/presto/type/TestMapOperators.java].

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-25 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663368#comment-16663368
 ] 

Kazuaki Ishizaki commented on SPARK-25829:
--

cc [~ueshin]

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25824) Remove duplicated map entries in `showString`

2018-10-25 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663361#comment-16663361
 ] 

Kazuaki Ishizaki commented on SPARK-25824:
--

cc [~ueshin]
During the implementation of functions regarding array/map in Spark 2.4, the 
community discussed how the duplicated key should be treated. IIUC, the current 
SparkSQL does not define the behavior.

> Remove duplicated map entries in `showString`
> -
>
> Key: SPARK-25824
> URL: https://issues.apache.org/jira/browse/SPARK-25824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `showString` doesn't eliminate the duplication. So, it looks different from 
> the result of `collect` and select from saved rows.
> *Spark 2.2.2*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("SELECT map(1,2,1,3)").show
> +---+
> |map(1, 2, 1, 3)|
> +---+
> |Map(1 -> 3)|
> +---+
> {code}
> *Spark 2.3.0 ~ 2.4.0-rc4*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a")
> scala> sql("SELECT * FROM m").show
> ++
> |   a|
> ++
> |[1 -> 3]|
> ++
> scala> sql("SELECT map(1,2,1,3)").show
> ++
> | map(1, 2, 1, 3)|
> ++
> |[1 -> 2, 1 -> 3]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649428#comment-16649428
 ] 

Kazuaki Ishizaki commented on SPARK-25728:
--

cc: [~viirya], [~cloud_fan], [~mgaido], [~hyukjin.kwon], [~smilegator], 
[~rednaxelafx]

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry is to start a discussion about adding structure intermediate 
> representation for generating Java code from a program using DataFrame or 
> Dataset API, in addition to the current String-based representation.
> This addition is based on the discussions in [a 
> thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].
> Please feel free to comment on this JIRA entry or [Google 
> Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
>  too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25728:
-
Comment: was deleted

(was: This JIRA entry is to start a discussion about adding structure 
intermediate representation for generating Java code from a program using 
DataFrame or Dataset API, in addition to the current String-based 
representation.

This addition is based on the discussions in [a 
thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].

Please feel free to comment on this JIRA entry or [Google 
Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
 too.

)

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25728:
-
Description: 
This JIRA entry is to start a discussion about adding structure intermediate 
representation for generating Java code from a program using DataFrame or 
Dataset API, in addition to the current String-based representation.

This addition is based on the discussions in [a 
thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].

Please feel free to comment on this JIRA entry or [Google 
Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
 too.


> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry is to start a discussion about adding structure intermediate 
> representation for generating Java code from a program using DataFrame or 
> Dataset API, in addition to the current String-based representation.
> This addition is based on the discussions in [a 
> thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].
> Please feel free to comment on this JIRA entry or [Google 
> Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
>  too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649422#comment-16649422
 ] 

Kazuaki Ishizaki commented on SPARK-25728:
--

This JIRA entry is to start a discussion about adding structure intermediate 
representation for generating Java code from a program using DataFrame or 
Dataset API, in addition to the current String-based representation.

This addition is based on the discussions in [a 
thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].

Please feel free to comment on this JIRA entry or [Google 
Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
 too.



> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25728:
-
External issue URL:   (was: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing)

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25728:
-
External issue URL: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing
 External issue ID:   (was: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing)

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25728:
-
External issue ID: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25728:


 Summary: SPIP: Structured Intermediate Representation (Tungsten 
IR) for generating Java code
 Key: SPARK-25728
 URL: https://issues.apache.org/jira/browse/SPARK-25728
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs

2018-10-09 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-25497.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

> limit operation within whole stage codegen should not consume all the inputs
> 
>
> Key: SPARK-25497
> URL: https://issues.apache.org/jira/browse/SPARK-25497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue was discovered during https://github.com/apache/spark/pull/21738 . 
> It turns out that limit is not whole-stage-codegened correctly and always 
> consume all the inputs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs

2018-10-09 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reassigned SPARK-25497:


Assignee: Wenchen Fan

> limit operation within whole stage codegen should not consume all the inputs
> 
>
> Key: SPARK-25497
> URL: https://issues.apache.org/jira/browse/SPARK-25497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> This issue was discovered during https://github.com/apache/spark/pull/21738 . 
> It turns out that limit is not whole-stage-codegened correctly and always 
> consume all the inputs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634344#comment-16634344
 ] 

Kazuaki Ishizaki edited comment on SPARK-25538 at 10/1/18 5:21 PM:
---

This test case does not print {{63}} using master branch.

{code}
  test("test2") {
val df = spark.read.parquet("file:///SPARK-25538-repro")
val c1 = df.distinct.count
val c2 = df.sort("col_0").distinct.count
val c3 = df.withColumnRenamed("col_0", "new").distinct.count
val c0 = df.count
print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n")
  }

c1=64, c2=73, c3=64, c0=123
{code}


was (Author: kiszk):
This test case does not print {{63}}.

{code}
  test("test2") {
val df = spark.read.parquet("file:///SPARK-25538-repro")
val c1 = df.distinct.count
val c2 = df.sort("col_0").distinct.count
val c3 = df.withColumnRenamed("col_0", "new").distinct.count
val c0 = df.count
print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n")
  }

c1=64, c2=73, c3=64, c0=123
{code}

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634344#comment-16634344
 ] 

Kazuaki Ishizaki commented on SPARK-25538:
--

This test case does not print {{63}}.

{code}
  test("test2") {
val df = spark.read.parquet("file:///SPARK-25538-repro")
val c1 = df.distinct.count
val c2 = df.sort("col_0").distinct.count
val c3 = df.withColumnRenamed("col_0", "new").distinct.count
val c0 = df.count
print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n")
  }

c1=64, c2=73, c3=64, c0=123
{code}

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-30 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633568#comment-16633568
 ] 

Kazuaki Ishizaki commented on SPARK-25538:
--

Thank you. I will check it tonight in Japan.

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-28 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631605#comment-16631605
 ] 

Kazuaki Ishizaki commented on SPARK-25538:
--

Thank for upload a schema. While I looked at the schema, I am still not sure 
about the reason of this problem.
I would appreciate it if you could find a good input data that can reproduce a 
problem.

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-26 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629281#comment-16629281
 ] 

Kazuaki Ishizaki commented on SPARK-25538:
--

Hi [~Steven Rand], would it be possible to share the schema of this DataFrame?


> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25487) Refactor PrimitiveArrayBenchmark

2018-09-21 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-25487.
--
   Resolution: Fixed
 Assignee: Chenxiao Mao
Fix Version/s: 2.5.0

Issue resolved by pull request 22497
https://github.com/apache/spark/pull/22497

> Refactor PrimitiveArrayBenchmark
> 
>
> Key: SPARK-25487
> URL: https://issues.apache.org/jira/browse/SPARK-25487
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.5.0
>
>
> Refactor PrimitiveArrayBenchmark to use main method and print the output as a 
> separate file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25432) Consider if using standard getOrCreate from PySpark into JVM SparkSession would simplify code

2018-09-19 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621416#comment-16621416
 ] 

Kazuaki Ishizaki commented on SPARK-25432:
--

nit: description seems to be in {{environment}} now. 

> Consider if using standard getOrCreate from PySpark into JVM SparkSession 
> would simplify code
> -
>
> Key: SPARK-25432
> URL: https://issues.apache.org/jira/browse/SPARK-25432
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: As we saw in 
> [https://github.com/apache/spark/pull/22295/files] the logic can get a bit 
> out of sync. It _might_ make sense to try and simplify this so there's less 
> duplicated logic in Python & Scala around session set up.
>Reporter: holdenk
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance

2018-09-19 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25437:
-
Comment: was deleted

(was: Is such a feature for major release, not for maintenance release?)

> Using OpenHashMap replace HashMap improve Encoder Performance
> -
>
> Key: SPARK-25437
> URL: https://issues.apache.org/jira/browse/SPARK-25437
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: wangjiaochun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance

2018-09-16 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617102#comment-16617102
 ] 

Kazuaki Ishizaki commented on SPARK-25437:
--

Is such a feature for major release, not for maintenance release?

> Using OpenHashMap replace HashMap improve Encoder Performance
> -
>
> Key: SPARK-25437
> URL: https://issues.apache.org/jira/browse/SPARK-25437
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: wangjiaochun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method

2018-09-16 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25444:


 Summary: Refactor GenArrayData.genCodeToCreateArrayData() method
 Key: SPARK-25444
 URL: https://issues.apache.org/jira/browse/SPARK-25444
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.5.0
Reporter: Kazuaki Ishizaki


{{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a 
temporary Java array to create  {{ArrayData}}. It can be eliminated by using 
{{ArrayData.createArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2018-09-12 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611717#comment-16611717
 ] 

Kazuaki Ishizaki commented on SPARK-20184:
--

In {{branch-2.4}}, we still see the performance degradation compared to w/o 
codegen
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11 on Linux 
4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
SPARK-20184: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

codegen = T   2915 / 3204  0.0  
2915001883.0   1.0X
codegen = F   1178 / 1368  0.0  
1178020462.0   2.5X
{code}
 

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>Priority: Major
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2018-09-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611633#comment-16611633
 ] 

Kazuaki Ishizaki commented on SPARK-16196:
--

[~cloud_fan] This PR in the Jira entry proposes two fixes
 # Read data in a table cache directry from a columnar storage
 # Generate code to build a table cache

We already implemented 1. But, we have not implmented 2. yet. Let us address 2. 
in the next release.

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2018-09-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611502#comment-16611502
 ] 

Kazuaki Ishizaki commented on SPARK-20184:
--

Although I created another JIRA 
https://issues.apache.org/jira/browse/SPARK-20479, there is no PR. Let me check 
the performance in 2.4 branch.

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>Priority: Major
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2018-09-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611494#comment-16611494
 ] 

Kazuaki Ishizaki commented on SPARK-16196:
--

I see. I will check this.

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result

2018-09-09 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25388:


 Summary: checkEvaluation may miss incorrect nullable of DataType 
in the result
 Key: SPARK-25388
 URL: https://issues.apache.org/jira/browse/SPARK-25388
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki


Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in 
{{checkEvaluationWithUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-05 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604120#comment-16604120
 ] 

Kazuaki Ishizaki commented on SPARK-25317:
--

When I have been investigating this issue, I realized that # of Javabyte code 
size in a method can change performance. I guess that this issue is related to 
method inlining. However, I have not found the root cause yet.

[~mgaido] Would it be possible to submit a PR to fix this issue if possible?

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25338) Several tests miss calling super.afterAll() in their afterAll() method

2018-09-04 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25338:


 Summary: Several tests miss calling super.afterAll() in their 
afterAll() method
 Key: SPARK-25338
 URL: https://issues.apache.org/jira/browse/SPARK-25338
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


The following tests under {{external}} may not call {{super.afterAll()}} in 
their {{afterAll()}} method.

{code}
external/flume/src/test/scala/org/apache/spark/streaming/flume/FlumePollingStreamSuite.scala
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaClusterSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala
external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisInputDStreamBuilderSuite.scala
external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602567#comment-16602567
 ] 

Kazuaki Ishizaki commented on SPARK-25317:
--

I confirmed this performance difference even after adding warmup. Let me 
investigate furthermore.

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602506#comment-16602506
 ] 

Kazuaki Ishizaki commented on SPARK-25317:
--

Let me run this on 2.3 and master.
One question. This benchmark does not have an warm up loop. In other words, 
this benchmark may include execution time on an interpreter, too. Is this 
behavior intentional?

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25317:
-
Description: 
eThere is a performance regression when calculating hash code for UTF8String:
{code:java}
  test("hashing") {
import org.apache.spark.unsafe.hash.Murmur3_x86_32
import org.apache.spark.unsafe.types.UTF8String
val hasher = new Murmur3_x86_32(0)
val str = UTF8String.fromString("b" * 10001)
val numIter = 10
val start = System.nanoTime
for (i <- 0 until numIter) {
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
}
val duration = (System.nanoTime() - start) / 1000 / numIter
println(s"duration $duration us")
  }
{code}
To run this test in 2.3, we need to add
{code:java}
public static int hashUTF8String(UTF8String str, int seed) {
return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
str.numBytes(), seed);
  }
{code}
to `Murmur3_x86_32`

In my laptop, the result for master vs 2.3 is: 120 us vs 40 us

  was:
There is a performance regression when calculating hash code for UTF8String:

{code}
  test("hashing") {
import org.apache.spark.unsafe.hash.Murmur3_x86_32
import org.apache.spark.unsafe.types.UTF8String
val hasher = new Murmur3_x86_32(0)
val str = UTF8String.fromString("b" * 10001)
val numIter = 10
val start = System.nanoTime
for (i <- 0 until numIter) {
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
}
val duration = (System.nanoTime() - start) / 1000 / numIter
println(s"duration $duration us")
  }
{code}

To run this test in 2.3, we need to add
{code}
public static int hashUTF8String(UTF8String str, int seed) {
return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
str.numBytes(), seed);
  }
{code}
to `Murmur3_x86_32`

In my laptop, the result for master vs 2.3 is: 120 us vs 40 us


> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import 

[jira] [Updated] (SPARK-25310) ArraysOverlap may throw a CompileException

2018-09-02 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25310:
-
Description: 
Invoking {{ArraysOverlap}} function with non-nullable array type throws the 
following error in the code generation phase.

{code:java}
Code generation of arrays_overlap([1,2,3], [4,5,3]) failed:
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, 
Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an 
rvalue
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, 
Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an 
rvalue
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:48)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:32)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1260)
{code}

> ArraysOverlap may throw a CompileException
> --
>
> Key: SPARK-25310
> URL: https://issues.apache.org/jira/browse/SPARK-25310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> Invoking {{ArraysOverlap}} function with non-nullable array type throws the 
> following error in the code generation phase.
> {code:java}
> Code generation of arrays_overlap([1,2,3], [4,5,3]) failed:
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: Expression "isNull_0" is not an rvalue
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: Expression "isNull_0" is not an rvalue
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305)
>   at 
> 

[jira] [Updated] (SPARK-25310) ArraysOverlap may throw a CompileException

2018-09-02 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25310:
-
Summary: ArraysOverlap may throw a CompileException  (was: ArraysOverlap 
throws an Exception)

> ArraysOverlap may throw a CompileException
> --
>
> Key: SPARK-25310
> URL: https://issues.apache.org/jira/browse/SPARK-25310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25310) ArraysOverlap throws an Exception

2018-09-02 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25310:


 Summary: ArraysOverlap throws an Exception
 Key: SPARK-25310
 URL: https://issues.apache.org/jira/browse/SPARK-25310
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25178) Directly ship the StructType objects of the keySchema / valueSchema for xxxHashMapGenerator

2018-08-22 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25178:
-
Summary: Directly ship the StructType objects of the keySchema / 
valueSchema for xxxHashMapGenerator  (was: Use dummy name for 
xxxHashMapGenerator key/value schema field)

> Directly ship the StructType objects of the keySchema / valueSchema for 
> xxxHashMapGenerator
> ---
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-21 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587897#comment-16587897
 ] 

Kazuaki Ishizaki commented on SPARK-25178:
--

[~rednaxelafx] Thank you for opening a JIRA entry :)
[~smilegator] I can take this.

> Use dummy name for xxxHashMapGenerator key/value schema field
> -
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25036:
-
Description: 
When compiling with sbt, the following errors occur:

There are -two- three types:
1. {{ExprValue.isNull}} is compared with unexpected type.
2. {{match may not be exhaustive}} is detected at {{match}}
3. discarding unmoored doc comment

The first one is more serious since it may also generate incorrect code in 
Spark 2.3.
{code:java}
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
Boolean = (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn] (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
 match may not be exhaustive.
[error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
ArrayData()), (_, _)
[error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
 match may not be exhaustive.
[error] It would fail on the following inputs: NewFunctionSpec(_, None, 
Some(_)), NewFunctionSpec(_, Some(_), None)
[error] [warn] newFunction match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely always compare unequal
[error] [warn] if (eval.isNull != "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]  if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn] if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
 match may not be exhaustive.
[error] It would fail on the following input: Schema((x: 
org.apache.spark.sql.types.DataType forSome x not in 
org.apache.spark.sql.types.StructType), _)
[error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
[error] [warn] 
{code}


{code:java}
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:410:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:441:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn] 
...
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:440:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn] 
{code}

  was:
When compiling with sbt, the following errors occur:

There are two types:
1. {{ExprValue.isNull}} is compared with unexpected type.
1. {{match may not be exhaustive}} is detected at {{match}}

The first one is more serious since it may also generate incorrect code in 
Spark 2.3.

{code}
[error] [warn] 

[jira] [Comment Edited] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575137#comment-16575137
 ] 

Kazuaki Ishizaki edited comment on SPARK-25036 at 8/9/18 5:05 PM:
--

Another type of compilation error is found. Added the log to the description


was (Author: kiszk):
Another type of compilation error is found

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> When compiling with sbt, the following errors occur:
> There are two types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 1. {{match may not be exhaustive}} is detected at {{match}}
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Reopened] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reopened SPARK-25036:
--

Another type of compilation error is found

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> When compiling with sbt, the following errors occur:
> There are two types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 1. {{match may not be exhaustive}} is detected at {{match}}
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25059) Exception while executing an action on DataFrame that read Json

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575129#comment-16575129
 ] 

Kazuaki Ishizaki commented on SPARK-25059:
--

Thank you for reporting the issue. Could you please try this using Spark 2.3?
This is because the community extensively investigated and fixed these issues 
in Spark 2.3

> Exception while executing an action on DataFrame that read Json
> ---
>
> Key: SPARK-25059
> URL: https://issues.apache.org/jira/browse/SPARK-25059
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
> Environment: AWS EMR 5.8.0 
> Spark 2.2.0 
>  
>Reporter: Kunal Goswami
>Priority: Major
>  Labels: Spark-SQL
>
> When I try to read ~9600 Json files using
> {noformat}
> val test = spark.read.option("header", true).option("inferSchema", 
> true).json(paths: _*) {noformat}
>  
> Any action on the above created data frame results in: 
> {noformat}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply2_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class "org.apache.spark.sql.catalyst.expressions.Generat[73/1850]
> pecificUnsafeProjection" grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:949)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:839)
>   at org.codehaus.janino.UnitCompiler.writeOpcode(UnitCompiler.java:11081)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4546)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436)
>   at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2471)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220)
>   at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$IfStatement.accept(Java.java:2621)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436)
>   at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2471)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220)
>   at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378)
>   at 
> 

[jira] [Updated] (SPARK-25041) genjavadoc-plugin_0.10 is not found with sbt in scala-2.12

2018-08-07 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25041:
-
Summary: genjavadoc-plugin_0.10 is not found with sbt in scala-2.12  (was: 
genjavadoc-plugin_2.12.6 is not found with sbt in scala-2.12)

> genjavadoc-plugin_0.10 is not found with sbt in scala-2.12
> --
>
> Key: SPARK-25041
> URL: https://issues.apache.org/jira/browse/SPARK-25041
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> When the master is build with sbt in scala-2.12, the following error occurs:
> {code}
> [warn]module not found: 
> com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10
> [warn]  public: tried
> [warn]   
> https://repo1.maven.org/maven2/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom
> [warn]  Maven2 Local: tried
> [warn]   
> file:/gsa/jpngsa/home/i/s/ishizaki/.m2/repository/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom
> [warn]  local: tried
> [warn]   
> /gsa/jpngsa/home/i/s/ishizaki/.ivy2/local/com.typesafe.genjavadoc/genjavadoc-plugin_2.12.6/0.10/ivys/ivy.xml
> [info] Resolving jline#jline;2.14.3 ...
> [warn]::
> [warn]::  UNRESOLVED DEPENDENCIES ::
> [warn]::
> [warn]:: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not 
> found
> [warn]::
> [warn] 
> [warn]Note: Unresolved dependencies path:
> [warn]com.typesafe.genjavadoc:genjavadoc-plugin_2.12.6:0.10 
> (/home/ishizaki/Spark/PR/scala212/spark/project/SparkBuild.scala#L118)
> [warn]  +- org.apache.spark:spark-tags_2.12:2.4.0-SNAPSHOT
> sbt.ResolveException: unresolved dependency: 
> com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found
>   at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:320)
>   at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
>   at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168)
>   at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
>   at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
>   at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133)
>   at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
>   at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
>   at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
>   at 
> xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
>   at 
> xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
>   at xsbt.boot.Using$.withResource(Using.scala:10)
>   at xsbt.boot.Using$.apply(Using.scala:9)
>   at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
>   at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
>   at xsbt.boot.Locks$.apply0(Locks.scala:31)
>   at xsbt.boot.Locks$.apply(Locks.scala:28)
>   at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
>   at sbt.IvySbt.withIvy(Ivy.scala:128)
>   at sbt.IvySbt.withIvy(Ivy.scala:125)
>   at sbt.IvySbt$Module.withModule(Ivy.scala:156)
>   at sbt.IvyActions$.updateEither(IvyActions.scala:168)
>   at 
> sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1555)
>   at 
> sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1551)
>   at 
> sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1586)
>   at 
> sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1584)
>   at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
>   at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1589)
>   at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1583)
>   at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
>   at sbt.Classpaths$.cachedUpdate(Defaults.scala:1606)
>   at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1533)
>   at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1485)
>   at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
>   at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
>   at sbt.std.Transform$$anon$4.work(System.scala:63)
>   at 
> sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
>   at 
> sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
>   at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
>   at sbt.Execute.work(Execute.scala:237)
>   

[jira] [Created] (SPARK-25041) genjavadoc-plugin_2.12.6 is not found with sbt in scala-2.12

2018-08-07 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25041:


 Summary: genjavadoc-plugin_2.12.6 is not found with sbt in 
scala-2.12
 Key: SPARK-25041
 URL: https://issues.apache.org/jira/browse/SPARK-25041
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


When the master is build with sbt in scala-2.12, the following error occurs:

{code}
[warn]  module not found: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10
[warn]  public: tried
[warn]   
https://repo1.maven.org/maven2/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom
[warn]  Maven2 Local: tried
[warn]   
file:/gsa/jpngsa/home/i/s/ishizaki/.m2/repository/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom
[warn]  local: tried
[warn]   
/gsa/jpngsa/home/i/s/ishizaki/.ivy2/local/com.typesafe.genjavadoc/genjavadoc-plugin_2.12.6/0.10/ivys/ivy.xml
[info] Resolving jline#jline;2.14.3 ...
[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found
[warn]  ::
[warn] 
[warn]  Note: Unresolved dependencies path:
[warn]  com.typesafe.genjavadoc:genjavadoc-plugin_2.12.6:0.10 
(/home/ishizaki/Spark/PR/scala212/spark/project/SparkBuild.scala#L118)
[warn]+- org.apache.spark:spark-tags_2.12:2.4.0-SNAPSHOT
sbt.ResolveException: unresolved dependency: 
com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:320)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at 
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
at 
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
at sbt.IvySbt.withIvy(Ivy.scala:128)
at sbt.IvySbt.withIvy(Ivy.scala:125)
at sbt.IvySbt$Module.withModule(Ivy.scala:156)
at sbt.IvyActions$.updateEither(IvyActions.scala:168)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1555)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1551)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1586)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1584)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1589)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1583)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1606)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1533)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1485)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at 

[jira] [Created] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-06 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25036:


 Summary: Scala 2.12 issues: Compilation error with sbt
 Key: SPARK-25036
 URL: https://issues.apache.org/jira/browse/SPARK-25036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.4.0
Reporter: Kazuaki Ishizaki


When compiling with sbt, the following errors occur:

There are two types:
1. {{ExprValue.isNull}} is compared with unexpected type.
1. {{match may not be exhaustive}} is detected at {{match}}

The first one is more serious since it may also generate incorrect code in 
Spark 2.3.

{code}
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
Boolean = (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn] (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
 match may not be exhaustive.
[error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
ArrayData()), (_, _)
[error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
 match may not be exhaustive.
[error] It would fail on the following inputs: NewFunctionSpec(_, None, 
Some(_)), NewFunctionSpec(_, Some(_), None)
[error] [warn] newFunction match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely always compare unequal
[error] [warn] if (eval.isNull != "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]  if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn] if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
 match may not be exhaustive.
[error] It would fail on the following input: Schema((x: 
org.apache.spark.sql.types.DataType forSome x not in 
org.apache.spark.sql.types.StructType), _)
[error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
[error] [warn] 
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-06 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570440#comment-16570440
 ] 

Kazuaki Ishizaki commented on SPARK-25029:
--

[~srowen][~skonto] Thank you for your investigations while I am creating 
scala-2.12 environment ( I still get compilation errors with scala-2.12 using 
sbt.)

I got the situation... It is related to {{default}} method. We may have to 
update a method lookup algorithm to consider {{default}} in janino.



> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094)
> at 

[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569621#comment-16569621
 ] 

Kazuaki Ishizaki commented on SPARK-25029:
--

[~srowen] I see. The following parts have the method. I will try to see it. My 
first feeling is that the problem may be in the scala collection library or 
catalyst Java code generator.

{code}
...
/* 146 */ final int length_1 = MapObjects_loopValue140.size();
...
/* 315 */ final int length_0 = MapObjects_loopValue140.size();
...
{code}


> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)
> at 
> 

[jira] [Created] (SPARK-24962) refactor CodeGenerator.createUnsafeArray

2018-07-29 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-24962:


 Summary: refactor CodeGenerator.createUnsafeArray
 Key: SPARK-24962
 URL: https://issues.apache.org/jira/browse/SPARK-24962
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


{{CodeGenerator.createUnsafeArray()}} generates code for allocating 
{{UnsafeArrayData}}. This method can support to generate code for allocating 
{{UnsafeArrayData}} or {{GenericArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560600#comment-16560600
 ] 

Kazuaki Ishizaki commented on SPARK-24895:
--

[~ericfchang] Thank you very much for your suggestion. As the first step, I 
created [a PR|https://github.com/apache/spark/pull/21905] to upgrade maven.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4

2018-07-27 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-24956:


 Summary: Upgrade maven from 3.3.9 to 3.5.4
 Key: SPARK-24956
 URL: https://issues.apache.org/jira/browse/SPARK-24956
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest.

As suggest in SPARK-24895, the current maven will see a problem with some 
plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559987#comment-16559987
 ] 

Kazuaki Ishizaki commented on SPARK-24895:
--

I see. Thank you very much. At first, I will try to make a PR to upgrade a 
maven.

BTW, I have no idea to make sure maven central repo works well for now.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559974#comment-16559974
 ] 

Kazuaki Ishizaki commented on SPARK-24895:
--

[~yhuai] Thank you.

BTW, how can I re-enable spotbugs without this problem? Do you have any 
suggestion? cc: [~hyukjin.kwon]

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24925) input bytesRead metrics fluctuate from time to time

2018-07-27 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559972#comment-16559972
 ] 

Kazuaki Ishizaki commented on SPARK-24925:
--

Do we need a test case or which test case covers this PR?

> input bytesRead metrics fluctuate from time to time
> ---
>
> Key: SPARK-24925
> URL: https://issues.apache.org/jira/browse/SPARK-24925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
> Attachments: bytesRead.gif
>
>
> input bytesRead metrics fluctuate from time to time, it is worse when 
> pushdown enabled.
> Query
> {code:java}
> CREATE TABLE dev AS
> SELECT
> ...
> FROM lstg_item cold, lstg_item_vrtn v
> WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> ...
> {code}
> Issue
> See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 
> 54GB, 53GB ... 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24841) Memory leak in converting spark dataframe to pandas dataframe

2018-07-22 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552231#comment-16552231
 ] 

Kazuaki Ishizaki commented on SPARK-24841:
--

Thank you for reporting an issue with heap profiling. Would it be possible to 
post a standalone program that can reproduce this problem?

> Memory leak in converting spark dataframe to pandas dataframe
> -
>
> Key: SPARK-24841
> URL: https://issues.apache.org/jira/browse/SPARK-24841
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Running PySpark in standalone mode
>Reporter: Piyush Seth
>Priority: Minor
>
> I am running a continuous running application using PySpark. In one of the 
> operations I have to convert PySpark data frame to Pandas data frame using 
> toPandas API  on pyspark driver. After running for a while I am getting 
> "java.lang.OutOfMemoryError: GC overhead limit exceeded" error.
> I tried running this in a loop and could see that the heap memory is 
> increasing continuously. When I ran jmap for the first time I had the 
> following top rows:
>  num #instances #bytes  class name
> --
>    1:  1757  411477568  [J
> {color:#FF}   *2:    124188  266323152  [C*{color}
>    3:    167219   46821320  org.apache.spark.status.TaskDataWrapper
>    4: 69683   27159536  [B
>    5:    359278    8622672  java.lang.Long
>    6:    221808    7097856  
> java.util.concurrent.ConcurrentHashMap$Node
>    7:    283771    6810504  scala.collection.immutable.$colon$colon
> After running several iterations I had the following
>  num #instances #bytes  class name
> --
> {color:#FF}   *1:    110760 3439887928  [C*{color}
>    2:   698  411429088  [J
>    3:    238096   6880  org.apache.spark.status.TaskDataWrapper
>    4: 68819   24050520  [B
>    5:    498308   11959392  java.lang.Long
>    6:    292741    9367712  
> java.util.concurrent.ConcurrentHashMap$Node
>    7:    282878    6789072  scala.collection.immutable.$colon$colon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >