[GitHub] spark pull request #19440: [SPARK-21871][SQL] Fix infinite loop when bytecod...

2017-10-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19440


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19440: [SPARK-21871][SQL] Fix infinite loop when bytecod...

2017-10-05 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19440#discussion_r143094785
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -392,12 +392,16 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 
 // Check if compiled code has a too large function
 if (maxCodeSize > sqlContext.conf.hugeMethodLimit) {
-  logWarning(s"Found too long generated codes and JIT optimization 
might not work: " +
-s"the bytecode size was $maxCodeSize, this value went over the 
limit " +
+  logInfo(s"Found too long generated codes and JIT optimization might 
not work: " +
+s"the bytecode size ($maxCodeSize) is above the limit " +
 s"${sqlContext.conf.hugeMethodLimit}, and the whole-stage codegen 
was disabled " +
 s"for this plan. To avoid this, you can raise the limit " +
-s"${SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key}:\n$treeString")
-  return child.execute()
+s"`${SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key}`:\n$treeString")
+  child match {
+// For batch file source scan, we should continue executing it
+case f: FileSourceScanExec if f.supportsBatch => // do nothing
--- End diff --

yea, I totally agree that we need to refactor this in future. Anyway, it's 
ok for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19440: [SPARK-21871][SQL] Fix infinite loop when bytecod...

2017-10-05 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19440#discussion_r143092189
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -392,12 +392,16 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 
 // Check if compiled code has a too large function
 if (maxCodeSize > sqlContext.conf.hugeMethodLimit) {
-  logWarning(s"Found too long generated codes and JIT optimization 
might not work: " +
-s"the bytecode size was $maxCodeSize, this value went over the 
limit " +
+  logInfo(s"Found too long generated codes and JIT optimization might 
not work: " +
+s"the bytecode size ($maxCodeSize) is above the limit " +
 s"${sqlContext.conf.hugeMethodLimit}, and the whole-stage codegen 
was disabled " +
 s"for this plan. To avoid this, you can raise the limit " +
-s"${SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key}:\n$treeString")
-  return child.execute()
+s"`${SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key}`:\n$treeString")
+  child match {
+// For batch file source scan, we should continue executing it
--- End diff --

It's better to explain why we should continue it. Otherwise later readers 
may not understand it immediately.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19440: [SPARK-21871][SQL] Fix infinite loop when bytecod...

2017-10-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19440#discussion_r143091966
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -185,4 +185,22 @@ class WholeStageCodegenSuite extends SparkPlanTest 
with SharedSQLContext {
 val (_, maxCodeSize2) = CodeGenerator.compile(codeWithLongFunctions)
 assert(maxCodeSize2 > 
SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.defaultValue.get)
   }
+
+  test("returning batch for wide table") {
+import testImplicits._
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+  val df = spark.range(10).select(Seq.tabulate(201) {i => ('id + 
i).as(s"c$i")} : _*)
+  df.write.mode(SaveMode.Overwrite).parquet(path)
+
+  withSQLConf(SQLConf.WHOLESTAGE_MAX_NUM_FIELDS.key -> "202",
+SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key -> "8000") {
+// donot return batch, because whole stage codegen is disabled for 
wide table (>202 columns)
--- End diff --

this is copied and pasted. will fix it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19440: [SPARK-21871][SQL] Fix infinite loop when bytecod...

2017-10-05 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19440#discussion_r143091744
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -185,4 +185,22 @@ class WholeStageCodegenSuite extends SparkPlanTest 
with SharedSQLContext {
 val (_, maxCodeSize2) = CodeGenerator.compile(codeWithLongFunctions)
 assert(maxCodeSize2 > 
SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.defaultValue.get)
   }
+
+  test("returning batch for wide table") {
+import testImplicits._
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+  val df = spark.range(10).select(Seq.tabulate(201) {i => ('id + 
i).as(s"c$i")} : _*)
+  df.write.mode(SaveMode.Overwrite).parquet(path)
+
+  withSQLConf(SQLConf.WHOLESTAGE_MAX_NUM_FIELDS.key -> "202",
+SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key -> "8000") {
+// donot return batch, because whole stage codegen is disabled for 
wide table (>202 columns)
--- End diff --

Is this comment wrong or I misunderstand it? Looks like it returns batch as 
it asserts `supportsBatch`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19440: [SPARK-21871][SQL] Fix infinite loop when bytecod...

2017-10-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19440#discussion_r143089714
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -392,12 +392,16 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 
 // Check if compiled code has a too large function
 if (maxCodeSize > sqlContext.conf.hugeMethodLimit) {
-  logWarning(s"Found too long generated codes and JIT optimization 
might not work: " +
-s"the bytecode size was $maxCodeSize, this value went over the 
limit " +
+  logInfo(s"Found too long generated codes and JIT optimization might 
not work: " +
+s"the bytecode size ($maxCodeSize) is above the limit " +
 s"${sqlContext.conf.hugeMethodLimit}, and the whole-stage codegen 
was disabled " +
 s"for this plan. To avoid this, you can raise the limit " +
-s"${SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key}:\n$treeString")
-  return child.execute()
+s"`${SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key}`:\n$treeString")
+  child match {
+// For batch file source scan, we should continue executing it
+case f: FileSourceScanExec if f.supportsBatch => // do nothing
--- End diff --

If we do it in `FileSourceScanExec `, we are unable to know which causes 
the fallback. Now, we have at least two reasons that trigger the fallback. 

Ideally, we should not call `WholeStageCodegenExec ` in `doExecute`. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19440: [SPARK-21871][SQL] Fix infinite loop when bytecod...

2017-10-05 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19440#discussion_r143086096
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -392,12 +392,16 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 
 // Check if compiled code has a too large function
 if (maxCodeSize > sqlContext.conf.hugeMethodLimit) {
-  logWarning(s"Found too long generated codes and JIT optimization 
might not work: " +
-s"the bytecode size was $maxCodeSize, this value went over the 
limit " +
+  logInfo(s"Found too long generated codes and JIT optimization might 
not work: " +
+s"the bytecode size ($maxCodeSize) is above the limit " +
 s"${sqlContext.conf.hugeMethodLimit}, and the whole-stage codegen 
was disabled " +
 s"for this plan. To avoid this, you can raise the limit " +
-s"${SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key}:\n$treeString")
-  return child.execute()
+s"`${SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key}`:\n$treeString")
+  child match {
+// For batch file source scan, we should continue executing it
+case f: FileSourceScanExec if f.supportsBatch => // do nothing
--- End diff --

I feel a little weird `WholeStageCodegenExec` has specific error handling 
for each spark plan. Could we handle this error inside `FileSourceScanExec`? 
For example, how about checking if `parent.isInstanceOf[WholeStageCodegenExec]` 
in `FileSourceScanExec`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19440: [SPARK-21871][SQL] Fix infinite loop when bytecod...

2017-10-05 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/19440

[SPARK-21871][SQL] Fix infinite loop when bytecode size is larger than 
spark.sql.codegen.hugeMethodLimit

## What changes were proposed in this pull request?
When exceeding `spark.sql.codegen.hugeMethodLimit`, the runtime fallbacks 
to the Volcano iterator solution. This could cause an infinite loop when 
`FileSourceScanExec` can use the columnar batch to read the data. This PR is to 
fix the issue.

## How was this patch tested?
Added a test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark testt

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19440.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19440


commit 6e2f53238856942e40a3301108f37a3a5cc17bca
Author: gatorsmile 
Date:   2017-10-05T18:20:48Z

fix.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org