[GitHub] spark pull request #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInP...

2018-11-19 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/23079#discussion_r234534798
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseInPredicateSuite.scala
 ---
@@ -298,6 +299,45 @@ class ReplaceNullWithFalseSuite extends PlanTest {
 testProjection(originalExpr = column, expectedExpr = column)
   }
 
+  test("replace nulls in lambda function of ArrayFilter") {
+val cond = GreaterThan(UnresolvedAttribute("e"), Literal(0))
--- End diff --

Updated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInP...

2018-11-18 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/23079#discussion_r234508866
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseInPredicateSuite.scala
 ---
@@ -298,6 +299,45 @@ class ReplaceNullWithFalseSuite extends PlanTest {
 testProjection(originalExpr = column, expectedExpr = column)
   }
 
+  test("replace nulls in lambda function of ArrayFilter") {
+val cond = GreaterThan(UnresolvedAttribute("e"), Literal(0))
--- End diff --

Actually I intentionally made all three lambda the same (the `MapFilter` 
one only differs in the lambda parameter). I can encapsulate this lambda 
function into a test utility function. Let me update the PR and see what you 
think.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInP...

2018-11-18 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/23079#discussion_r234508561
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -767,6 +767,15 @@ object ReplaceNullWithFalse extends Rule[LogicalPlan] {
   replaceNullWithFalse(cond) -> value
 }
 cw.copy(branches = newBranches)
+  case af @ ArrayFilter(_, lf @ LambdaFunction(func, _, _)) =>
--- End diff --

I'm not sure if that's useful or not. First of all, the 
`replaceNullWithFalse` handling doesn't apply to all higher-order functions. In 
fact it only applies to a very narrow set, ones where a lambda function returns 
`BooleanType` and is immediately used as a predicate. So having a generic 
utility can certainly help make this PR slightly simpler, but I don't know how 
useful it is for other cases.
I'd prefer waiting for more such transformation cases to introduce a new 
utility for the pattern. WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInPredicat...

2018-11-18 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/23079
  
cc @aokolnychyi : I'd like to propose renaming the rule you introduced to 
add a `-InPredicate` suffix, because obviously we can't replace arbitrary 
`null`s with `false`, but only the ones that are going to be directly used in a 
boolean predicate context (e.g. `If(cond, _, _)`, `Filter` etc that you've 
already nicely identified). Does that make sense to you?

BTW thank you very much for introducing that rule. It's really neat!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInP...

2018-11-18 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/23079

[SPARK-26107][SQL] Extend ReplaceNullWithFalseInPredicate to support 
higher-order functions: ArrayExists, ArrayFilter, MapFilter

## What changes were proposed in this pull request?

Extend the `ReplaceNullWithFalse` optimizer rule introduced in SPARK-25860 
(https://github.com/apache/spark/pull/22857) to also support optimizing 
predicates in higher-order functions of `ArrayExists`, `ArrayFilter`, 
`MapFilter`.

Also rename the rule to `ReplaceNullWithFalseInPredicate` to better reflect 
its intent.

## How was this patch tested?

Added new unit test cases to the `ReplaceNullWithFalseInPredicateSuite` 
(renamed from `ReplaceNullWithFalseSuite`).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark catalyst-master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23079.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23079


commit 710c8862b3138f6146fe2309d6379707f8d4ac14
Author: Kris Mok 
Date:   2018-11-18T09:09:53Z

Extend ReplaceNullWithFalseInPredicate to support higher-order functions: 
ArrayExists, ArrayFilter, MapFilter




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-11-16 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r234385204
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallbackSuite.scala
 ---
@@ -17,17 +17,33 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
+import java.util.concurrent.ExecutionException
+
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodeAndComment, 
CodeGenerator}
 import org.apache.spark.sql.catalyst.plans.PlanTestBase
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.sql.types.{IntegerType, LongType}
+import org.apache.spark.sql.types.IntegerType
 
 class CodeGeneratorWithInterpretedFallbackSuite extends SparkFunSuite with 
PlanTestBase {
 
-  test("UnsafeProjection with codegen factory mode") {
-val input = Seq(LongType, IntegerType)
-  .zipWithIndex.map(x => BoundReference(x._2, x._1, true))
+  object FailedCodegenProjection
+  extends CodeGeneratorWithInterpretedFallback[Seq[Expression], 
UnsafeProjection] {
+
+override protected def createCodeGeneratedObject(in: Seq[Expression]): 
UnsafeProjection = {
+  val invalidCode = new CodeAndComment("invalid code", Map.empty)
+  // We assume this compilation throws an exception
--- End diff --

The suggested change is only for making this test suite cleaner, right? In 
that case I'd +1 with the suggestion of being able to clearly check we're 
catching the exception we know we're throwing.
Would you like to submit a PR for it?

> rather than returning `null`
The intent is never to actually reach the null-return, but always cause an 
exception to be thrown at `CodeGenerator.compile()` and abruptly return to the 
caller with the exception. To make the compiler happy you'll have to have some 
definite-returning statement to end the function, so a useless null-return 
would probably have to be there anyway (since the compiler can't tell you'll 
always be throwing an exception unless you do a throw inline)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22976: [SPARK-25974][SQL]Optimizes Generates bytecode fo...

2018-11-15 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22976#discussion_r233771269
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -68,62 +68,55 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
 genComparisons(ctx, ordering)
   }
 
+  /**
+   * Creates the variables for ordering based on the given order.
+   */
+  private def createOrderKeys(
+  ctx: CodegenContext,
+  row: String,
+  ordering: Seq[SortOrder]): Seq[ExprCode] = {
+ctx.INPUT_ROW = row
+// to use INPUT_ROW we must make sure currentVars is null
+ctx.currentVars = null
--- End diff --

I thought about making the same suggestion, but on a second thought it's 
fine to just leave it here as well. The overhead isn't gonna be high, but it's 
nice to see how the `ctx.INPUT_ROW` and `ctx.currentVars` interact in adjacent 
lines.

So I'm okay either way (as-is or with @viirya 's suggestion)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22976: [SPARK-25974][SQL]Optimizes Generates bytecode for order...

2018-11-15 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22976
  
+1 LGTM, thanks for making this improvement!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23032: [WIP][SPARK-26061][SQL][MINOR] Reduce the number of unus...

2018-11-14 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/23032
  
Ha, it's failing an Avro test case due to:
```
Cause: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 65, Column 33: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 65, 
Column 33: Expression "serializefromobject_isNull_1" is not an rvalue
```
Can be reproduced by `build/sbt -Pavro "avro/test-only *AvroSuite -- -z 
partitioned"`

So there must have been something that mutated the state, that moving the 
`prepareRowVar()` call down by a few lines stomped on some side effects.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23032: [SPARK-26061][SQL][MINOR] Reduce the number of unused Un...

2018-11-14 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/23032
  
cc @viirya @maropu @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23032: [SPARK-XXXXX][SQL][MINOR] Reduce the number of un...

2018-11-14 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/23032

[SPARK-X][SQL][MINOR] Reduce the number of unused UnsafeRowWriters 
created in whole-stage codegen

## What changes were proposed in this pull request?

Reduce the number of unused `UnsafeRowWriter`s created in whole-stage 
generated code.
They come from the `CodegenSupport.consume()` calling `prepareRowVar()`, 
which uses `GenerateUnsafeProjection.createCode()` and registers an 
`UnsafeRowWriter` mutable state, regardless of whether or not the downstream 
(parent) operator will use the `rowVar` or not.
Even when the downstream `doConsume` function doesn't use the `rowVar` 
(i.e. doesn't put `row.code` as a part of this operator's codegen template), 
the registered `UnsafeRowWriter` stays there, which makes the init function of 
the generated code a bit bloated.

This PR doesn't heal the root issue, but makes it slightly less painful: 
when the `doConsume` function is split out, the `prepareRowVar()` function is 
called twice, so it's double the pain of unused `UnsafeRowWriter`s. This PR 
simply moves the original call to `prepareRowVar()` down into the `doConsume` 
split/no-split branch so that we're back to just 1x the pain.

To fix the root issue, something that allows the `CodegenSupport` operators 
to indicate whether or not they're going to use the `rowVar` would be needed. 
That's a much more elaborate change so I'd like to just make a minor fix first.

e.g. for this query: `spark.range(10).as[Long].map(x => x + 1).map(x => x * 
x + 2)`
**Before** (in Spark 2.4.0):
```java
/* 023 */   public void init(int index, scala.collection.Iterator[] inputs) 
{
/* 024 */ partitionIndex = index;
/* 025 */ this.inputs = inputs;
/* 026 */
/* 027 */ range_taskContext_0 = TaskContext.get();
/* 028 */ range_inputMetrics_0 = 
range_taskContext_0.taskMetrics().inputMetrics();
/* 029 */ range_mutableStateArray_0[0] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 030 */ range_mutableStateArray_0[1] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 031 */ range_mutableStateArray_0[2] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 032 */ range_mutableStateArray_0[3] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 033 */ range_mutableStateArray_0[4] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 034 */ range_mutableStateArray_0[5] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 035 */ range_mutableStateArray_0[6] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 036 */ range_mutableStateArray_0[7] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 037 */ range_mutableStateArray_0[8] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 038 */
/* 039 */   }
```
9 `UnsafeRowWriter`s created, 1 actually used.

**After**:
```java
/* 023 */   public void init(int index, scala.collection.Iterator[] inputs) 
{
/* 024 */ partitionIndex = index;
/* 025 */ this.inputs = inputs;
/* 026 */
/* 027 */ range_taskContext_0 = TaskContext.get();
/* 028 */ range_inputMetrics_0 = 
range_taskContext_0.taskMetrics().inputMetrics();
/* 029 */ deserializetoobject_mutableStateArray_0[0] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 030 */ deserializetoobject_mutableStateArray_0[1] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 031 */ deserializetoobject_mutableStateArray_0[2] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 032 */ deserializetoobject_mutableStateArray_0[3] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 033 */ deserializetoobject_mutableStateArray_0[4] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 034 */
/* 035 */   }
```
5 `UnsafeRowWriter`s created, 1 actually used.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark wscg-reduce-rowwriter

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23032.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23032


commit e48ec9da9977846213044c5f6d093202ed0efd60
Author: Kris Mok 
Date:   2018-11-14T09:22:11Z

Reduce the 

[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...

2018-10-31 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22847
  
Just in case people wonder, the following is the hack patch that I used for 
stress testing code splitting before this PR:
```diff
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -647,11 +647,13 @@ class CodegenContext(val useStreamlining: Boolean) {
* Returns a term name that is unique within this instance of a 
`CodegenContext`.
*/
   def freshName(name: String): String = synchronized {
-val fullName = if (freshNamePrefix == "") {
+// hack: intentionally add a very long prefix (length=300 characters) 
to
+// trigger code splitting more frequently
+val fullName = ("averylongprefix" * 20) + (if (freshNamePrefix == "") {
   name
 } else {
   s"${freshNamePrefix}_$name"
-}
+})
 if (freshNameIds.contains(fullName)) {
   val id = freshNameIds(fullName)
   freshNameIds(fullName) = id + 1
```
Of course, now with this PR, we can simply set the split threshold to a 
very low value (e.g. `1`) to force split.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...

2018-10-31 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22847#discussion_r229943260
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -812,6 +812,17 @@ object SQLConf {
 .intConf
 .createWithDefault(65535)
 
+  val CODEGEN_METHOD_SPLIT_THRESHOLD = 
buildConf("spark.sql.codegen.methodSplitThreshold")
+.internal()
+.doc("The threshold of source code length without comment of a single 
Java function by " +
+  "codegen to be split. When the generated Java function source code 
exceeds this threshold" +
+  ", it will be split into multiple small functions. We can't know how 
many bytecode will " +
+  "be generated, so use the code length as metric. A function's 
bytecode should not go " +
+  "beyond 8KB, otherwise it will not be JITted; it also should not be 
too small, otherwise " +
+  "there will be many function calls.")
+.intConf
--- End diff --

Oh I see, you're using the column name...that's not the right place to put 
the "prefix". Column names are almost never carried over to the generated code 
in the current framework (the only exception is the lambda variable name).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...

2018-10-31 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22847#discussion_r229942325
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -812,6 +812,17 @@ object SQLConf {
 .intConf
 .createWithDefault(65535)
 
+  val CODEGEN_METHOD_SPLIT_THRESHOLD = 
buildConf("spark.sql.codegen.methodSplitThreshold")
+.internal()
+.doc("The threshold of source code length without comment of a single 
Java function by " +
+  "codegen to be split. When the generated Java function source code 
exceeds this threshold" +
+  ", it will be split into multiple small functions. We can't know how 
many bytecode will " +
+  "be generated, so use the code length as metric. A function's 
bytecode should not go " +
+  "beyond 8KB, otherwise it will not be JITted; it also should not be 
too small, otherwise " +
+  "there will be many function calls.")
+.intConf
--- End diff --

The "freshNamePrefix" prefix is only applied in whole-stage codegen, 

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L87

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L169

It doesn't take any effect in non-whole-stage codegen.

If you intend to stress test expression codegen but don't see the prefix 
being prepended, you're probably not adding it in the right place. Where did 
you add it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...

2018-10-31 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22847
  
Can / should we apply the same threshold conf to 
`Expression.reduceCodeSize()`?
cc @yucai @cloud-fan @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...

2018-10-31 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22847
  
Using source code length will have to be a very coarse-grained, "fuzzy" 
heuristic. It's not meant to be accurate. So just pick some number that makes 
sense.
2048 might be a good enough choice for now.

Tidbit: before this PR, since the `1024` split threshold is hard coded, the 
way I stress test the code splitting path is to hack into 
`CodegenContext.freshName()` and artificially prepend a very long prefix to the 
symbol names.
e.g. "project_value_1" becomes 
"averylongprefixrepeatedmultipletimes_project_value_1". That inflates the 
source code size a lot which triggers code splitting a lot more frequently 
under the existing heuristics.
I'm mentioning this here to show that source code length based heuristics 
are inherently imprecise. It'll actually behave slightly differently for 
expressions in different physical operators because of the prefix length is 
different — "serializefromobject_" is much longer than "project_", which is 
longer than "agg_", right?

Side notes for those curious:
(details are way too complicated to be addressed by a source code length 
based heuristic, but I'm making this note just to let people know what's 
involved)

Performance-wise, it might be very interesting to note that, running on 
HotSpot, sometimes it might be better to split into more smaller methods, and 
sometimes it might be better to split into less but larger methods.

The theory is this: if a method is really hot, we'd like it to be 
eventually inlined by the JIT to reduce method invocation overhead and get 
better optimization. In HotSpot, hot methods will eventually be compiled by C2, 
and it uses multiple bytecode size based threshold to determine whether or not 
a callee is fit for inlining.
```
MaxInlineSize = 35 // in bytes of bytecode. "The maximum bytecode size of a 
method to be inlined"
FreqInlineSize = 325 // in bytes of bytecode. "The maximum bytecode size of 
a frequent method to be inlined"
MaxTrivialSize = 6 // in bytes of bytecode. "The maximum bytecode size of a 
trivial method to be inlined"
InlineFrequencyRatio = 20 // "Ratio of call site execution to caller method 
invocation"
// there's also a InlineFrequencyCount which we'll omit here for simplicity
```
(values shown above are default values for C2 on x86.)
c.f. 
http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/tip/src/share/vm/runtime/globals.hpp

Don't be confused by the name, though.
`MaxTrivialSize` is meant for regular getters/setters, and they're 
considered to be always inlineable (unless explicitly configured as 
`dontinline` in a compiler directive option).
`MaxInlineSize` is for regular small methods, with more checks than the 
"trivial" case, without checking the invocation frequency.
`InlineFrequencyRatio` is the callee-to-caller invocation ratio for a call 
site to be considered as "hot" (or "frequent"). The default is 20:1. This is 
usually for call sites inside of loops, and HotSpot can collect loop counts / 
invocation counts in profiling, to figure out the ratio.
Similarly, there's a `InlineFrequencyCount` that if a call site (after 
scaling according to inline depth etc) is invoked more than this many times, it 
can also be considered as "hot".
`FreqInlineSize` is the max size a "hot" call site can be considered for 
inlining. It's 325 bytes of bytecode, obviously larger than `MaxInlineSize` so 
it might sound confusing at first ;-)

So the deal is: on HotSpot, if you want to save bytecode code size by
- Outlining a utility method that's <= 6 bytes, there's no overhead at all. 
But in practice this is not really feasible, because at the call site of this 
utility method, "invokespecial" itself is 3 bytes, and you'd at least need to 
pass in "this" which is 1 byte, and if you need to pass in arguments (e.g. 
INPUT_ROW) that's at least 1 or 2 more bytes, which already adds up to 4 to 5 
bytes at the call site, whereas the callee's method body can only be <= 6 
bytes, there's not much you can do…
- Outlining a utility method that's <= 35 bytes. That's the easiest for 
both HotSpot C1 and C2 to inline. You can do somethings but not a lot in this 
method. If you want to shoot for this case, then don't make the split threshold 
too big.
- Outlining a utility method that's <= 325 bytes. In HotSpot, only C2 gets 
to inline big methods like this, and only "hot call sites" are considered.

So, if a method is compute-intensive and is small enough, it'd better to 
strive to keep the split-off methods within the thresholds mentioned above.
Otherwise, if it's alrea

[GitHub] spark pull request #22506: [SPARK-25494][SQL] Upgrade Spark's use of Janino ...

2018-09-20 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/22506

[SPARK-25494][SQL] Upgrade Spark's use of Janino to 3.0.10

## What changes were proposed in this pull request?

This PR upgrades Spark's use of Janino from 3.0.9 to 3.0.10.
Note that 3.0.10 is a out-of-band release specifically for fixing an 
integer overflow issue in Janino's `ClassFile` reader. It is otherwise exactly 
the same as 3.0.9, so it's a low risk and compatible upgrade.

The integer overflow issue affects Spark SQL's codegen stats collection: 
when a generated Class file is huge, especially when the constant pool size is 
above `Short.MAX_VALUE`, Janino's `ClassFile reader` will throw an exception 
when Spark wants to parse the generated Class file to collect stats. So we'll 
miss the stats of some huge Class files.

The related Janino issue is: 
https://github.com/janino-compiler/janino/issues/58

## How was this patch tested?

Existing codegen tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark upgrade-janino

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22506.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22506


commit c3f8a6b41c9409339bf62fb17cd4cd905853d97f
Author: Kris Mok 
Date:   2018-09-20T23:08:27Z

bump Janino to 3.0.10




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22355: [SPARK-25358][SQL] MutableProjection supports fallback t...

2018-09-18 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22355
  
LGTM as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22429: [SPARK-25440][SQL] Dumping query execution info t...

2018-09-17 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22429#discussion_r218235962
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala ---
@@ -189,23 +192,32 @@ class QueryExecution(val sparkSession: SparkSession, 
val logical: LogicalPlan) {
   """.stripMargin.trim
   }
 
+  private def writeOrError(writer: Writer)(f: Writer => Unit): Unit =
+try f(writer) catch { case e: AnalysisException => 
writer.write(e.toString) }
+
+  private def writePlans(writer: Writer): Unit = {
+writer.write("== Parsed Logical Plan ==\n")
+writeOrError(writer)(logical.treeString(_, verbose = true, addSuffix = 
false))
+writer.write("== Analyzed Logical Plan ==\n")
+writeOrError(writer) { writer =>
+  analyzed.output.foreach(o => writer.write(s"${o.name}: 
${o.dataType.simpleString}"))
--- End diff --

If you want to get the best out of this approach, it might be better to 
avoid string interpolation here and do the explicit `writer.write` calls for 
the things you're interpolating on.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22355: [SPARK-25358][SQL] MutableProjection supports fal...

2018-09-12 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22355#discussion_r217016675
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.InternalRow
+
+
+/**
+ * A [[MutableProjection]] that is calculated by calling `eval` on each of 
the specified
+ * expressions.
+ *
+ * @param expressions a sequence of expressions that determine the value 
of each column of the
+ *output row.
+ */
+class InterpretedMutableProjection(expressions: Seq[Expression]) extends 
MutableProjection {
+  def this(expressions: Seq[Expression], inputSchema: Seq[Attribute]) =
+this(expressions.map(toBoundExprs(_, inputSchema)))
--- End diff --

```
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder@2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala:32:
 type mismatch;
[error]  found   : org.apache.spark.sql.catalyst.expressions.Expression
[error]  required: Seq[org.apache.spark.sql.catalyst.expressions.Expression]
[error] this(expressions.map(toBoundExprs(_, inputSchema)))
[error]   ^
```

It's probably `this(toBoundExprs(expressions, inputSchema))` right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22355: [SPARK-25358][SQL] MutableProjection supports fal...

2018-09-10 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22355#discussion_r216525084
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.InternalRow
+
+
+/**
+ * A [[MutableProjection]] that is calculated by calling `eval` on each of 
the specified
+ * expressions.
+ *
+ * @param expressions a sequence of expressions that determine the value 
of each column of the
+ *output row.
+ */
+class InterpretedMutableProjection(expressions: Seq[Expression]) extends 
MutableProjection {
+  def this(expressions: Seq[Expression], inputSchema: Seq[Attribute]) =
+this(expressions.map(BindReferences.bindReference(_, inputSchema)))
--- End diff --

use `toBoundExpr`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22355: [SPARK-25358][SQL] MutableProjection supports fal...

2018-09-10 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22355#discussion_r216524666
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -86,24 +86,12 @@ package object expressions  {
   }
 
   /**
-   * Converts a [[InternalRow]] to another Row given a sequence of 
expression that define each
-   * column of the new row. If the schema of the input row is specified, 
then the given expression
-   * will be bound to that schema.
-   *
-   * In contrast to a normal projection, a MutableProjection reuses the 
same underlying row object
-   * each time an input row is added.  This significantly reduces the cost 
of calculating the
-   * projection, but means that it is not safe to hold on to a reference 
to a [[InternalRow]] after
-   * `next()` has been called on the [[Iterator]] that produced it. 
Instead, the user must call
-   * `InternalRow.copy()` and hold on to the returned [[InternalRow]] 
before calling `next()`.
+   * A helper function to bound given expressions to an input schema.
--- End diff --

Spelling nitpick: s/bound/bind/


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22355: [SPARK-25358][SQL] MutableProjection supports fal...

2018-09-10 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22355#discussion_r216526434
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.InternalRow
+
+
+/**
+ * A [[MutableProjection]] that is calculated by calling `eval` on each of 
the specified
+ * expressions.
+ *
+ * @param expressions a sequence of expressions that determine the value 
of each column of the
+ *output row.
+ */
+class InterpretedMutableProjection(expressions: Seq[Expression]) extends 
MutableProjection {
+  def this(expressions: Seq[Expression], inputSchema: Seq[Attribute]) =
+this(expressions.map(BindReferences.bindReference(_, inputSchema)))
+
+  private[this] val buffer = new Array[Any](expressions.size)
+
+  override def initialize(partitionIndex: Int): Unit = {
+expressions.foreach(_.foreach {
+  case n: Nondeterministic => n.initialize(partitionIndex)
+  case _ =>
+})
+  }
+
+  private[this] val exprArray = expressions.toArray
+  private[this] var mutableRow: InternalRow = new 
GenericInternalRow(exprArray.length)
+  def currentValue: InternalRow = mutableRow
+
+  override def target(row: InternalRow): MutableProjection = {
+mutableRow = row
+this
+  }
+
+  override def apply(input: InternalRow): InternalRow = {
+var i = 0
+while (i < exprArray.length) {
+  // Store the result into buffer first, to make the projection atomic 
(needed by aggregation)
+  buffer(i) = exprArray(i).eval(input)
+  i += 1
+}
+i = 0
+while (i < exprArray.length) {
+  mutableRow(i) = buffer(i)
+  i += 1
+}
+mutableRow
+  }
+}
--- End diff --

+1 on the check for `NoOp`s.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstract meth...

2018-08-23 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22203
  
Thanks @srowen ! In that case could we add a few mentions of such test 
cases that failed into the PR description itself, perhaps a few lines under the 
"How was this patch tested?" section to give a few examples of which tests used 
to fail and are now fixed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstract meth...

2018-08-23 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22203
  
I'm wondering, though, either in the same PR or in a separate PR, could we 
add a test case in Spark that would trigger the "two non-abstract method" error 
so that we'd show an example scenario in Spark that'd trigger the issue and 
also show that Janino 3.0.9 has fixed it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstract meth...

2018-08-23 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22203
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22187: [SPARK-25178][SQL] change the generated code of the keyS...

2018-08-22 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22187
  
So the new solution now is to directly ship the `StructType` object as a 
reference object. Why not? ;-)
I'm +1 on shipping the object directly instead of generating code to 
recreate it on the executor. If other reviewers are also on board with this 
approach, let's update the PR title and the JIRA ticket title to reflect that 
(since we're no longer generating any "names" at all now).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-08-21 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211731921
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallback.scala
 ---
@@ -17,24 +17,10 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
-import org.codehaus.commons.compiler.CompileException
-import org.codehaus.janino.InternalCompilerException
-
-import org.apache.spark.TaskContext
+import org.apache.spark.internal.Logging
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.util.Utils
 
-/**
- * Catches compile error during code generation.
- */
-object CodegenError {
-  def unapply(throwable: Throwable): Option[Exception] = throwable match {
-case e: InternalCompilerException => Some(e)
-case e: CompileException => Some(e)
--- End diff --

We didn't in the original PR. @maropu -san added one in this PR which is 
great, plugs the hole.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-08-21 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211731624
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallback.scala
 ---
@@ -63,7 +49,10 @@ abstract class CodeGeneratorWithInterpretedFallback[IN, 
OUT] {
 try {
   createCodeGeneratedObject(in)
 } catch {
-  case CodegenError(_) => createInterpretedObject(in)
+  case _: Exception =>
--- End diff --

Oh yeah that's a good point. +1 on using the `NonFatal` extractor, and we 
probably wanna do the same change for whole-stage codegen's fallback logic as 
well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-08-21 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211731392
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallback.scala
 ---
@@ -63,7 +49,10 @@ abstract class CodeGeneratorWithInterpretedFallback[IN, 
OUT] {
 try {
   createCodeGeneratedObject(in)
 } catch {
-  case CodegenError(_) => createInterpretedObject(in)
+  case _: Exception =>
+// We should have already seen the error message in 
`CodeGenerator`
+logWarning("Expr codegen error and falling back to interpreter 
mode")
--- End diff --

We could; although the detail exception type/message/stack trace info would 
already be printed in the preceding logging from the `CodeGenerator` so I 
wouldn't push strongly for adding the exception type here.
(It certainly is easier to do data science on just single lines of logs 
instead of having to search across a few lines for context, so I do see the 
benefits of adding the type information albeit redundant. I'll leave it up to 
you guys @maropu @viirya @cloud-fan @gatorsmile )


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-08-21 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211487231
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallbackSuite.scala
 ---
@@ -40,4 +55,13 @@ class CodeGeneratorWithInterpretedFallbackSuite extends 
SparkFunSuite with PlanT
   assert(obj.isInstanceOf[InterpretedUnsafeProjection])
 }
   }
+
+  test("fallback to the interpreter mode") {
+val input = Seq(IntegerType).zipWithIndex.map(x => 
BoundReference(x._2, x._1, true))
--- End diff --

I'd also like to suggest changing the other test cases in this suite to do 
something like `Seq(IntegerType).zipWithIndex.map { case (tpe, ordinal) => 
BoundReference(ordinal, tpe, true) }` to get rid of the `_2`, `_1`s. This is 
also optional. WDYT @maropu @viirya ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-08-21 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211484515
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala
 ---
@@ -180,7 +180,10 @@ object UnsafeProjection
 try {
   GenerateUnsafeProjection.generate(unsafeExprs, 
subexpressionEliminationEnabled)
 } catch {
-  case CodegenError(_) => 
InterpretedUnsafeProjection.createProjection(unsafeExprs)
+  case _: Exception =>
+// We should have already see error message in `CodeGenerator`
--- End diff --

Nit: Perhaps `// the error message is already logged in CodeGenerator`? Or 
`// We should have already seen the error message in CodeGenerator`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-08-21 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211485732
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallbackSuite.scala
 ---
@@ -40,4 +55,13 @@ class CodeGeneratorWithInterpretedFallbackSuite extends 
SparkFunSuite with PlanT
   assert(obj.isInstanceOf[InterpretedUnsafeProjection])
 }
   }
+
+  test("fallback to the interpreter mode") {
+val input = Seq(IntegerType).zipWithIndex.map(x => 
BoundReference(x._2, x._1, true))
+val fallback = CodegenObjectFactoryMode.FALLBACK.toString
+withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key -> fallback) {
+  val obj = FailedCodegenProjection.createObject(input)
+  assert(obj.isInstanceOf[InterpretedUnsafeProjection])
--- End diff --

Should/could we also assert on the log message being printed out? Perhaps 
something similar to 
https://github.com/apache/spark/pull/22103/files#diff-cf187b40d98ff322d4bde4185701baa2R508
 ?

This is optional; the fact that we've gotten back an 
`InterpretedUnsafeProjection` is a good enough indicator to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-08-21 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211485314
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallbackSuite.scala
 ---
@@ -40,4 +55,13 @@ class CodeGeneratorWithInterpretedFallbackSuite extends 
SparkFunSuite with PlanT
   assert(obj.isInstanceOf[InterpretedUnsafeProjection])
 }
   }
+
+  test("fallback to the interpreter mode") {
+val input = Seq(IntegerType).zipWithIndex.map(x => 
BoundReference(x._2, x._1, true))
--- End diff --

Nit: if we only wanted a one-element-sequence, would it be cleaner to just 
write
`Seq(BoundReference(0, IntegerType, true))`?
Or for those that prefer the cons list syntax, `BoundReference(0, 
IntegerType, true) :: Nil`. I prefer the former one but I don't have a strong 
opinion here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-08-21 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211485848
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallback.scala
 ---
@@ -63,7 +49,10 @@ abstract class CodeGeneratorWithInterpretedFallback[IN, 
OUT] {
 try {
   createCodeGeneratedObject(in)
 } catch {
-  case CodegenError(_) => createInterpretedObject(in)
+  case _: Exception =>
+// We should have already see error message in `CodeGenerator`
--- End diff --

Same as the comment below.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-08-21 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211484616
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala
 ---
@@ -180,7 +180,10 @@ object UnsafeProjection
 try {
   GenerateUnsafeProjection.generate(unsafeExprs, 
subexpressionEliminationEnabled)
 } catch {
-  case CodegenError(_) => 
InterpretedUnsafeProjection.createProjection(unsafeExprs)
+  case _: Exception =>
+// We should have already see error message in `CodeGenerator`
+logWarning("Expr codegen error and falls back to interpreter mode")
--- End diff --

Nit: I prefer "falling back" instead of "falls back" here. e.g. "Expr 
codegen error, falling back to interpreter mode"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-08-21 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211484374
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallbackSuite.scala
 ---
@@ -18,12 +18,27 @@
 package org.apache.spark.sql.catalyst.expressions
 
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodeAndComment, 
CodeGenerator}
 import org.apache.spark.sql.catalyst.plans.PlanTestBase
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.{IntegerType, LongType}
 
 class CodeGeneratorWithInterpretedFallbackSuite extends SparkFunSuite with 
PlanTestBase {
 
+  object FailedCodegenProjection
+  extends CodeGeneratorWithInterpretedFallback[Seq[Expression], 
UnsafeProjection] {
+
+override protected def createCodeGeneratedObject(in: Seq[Expression]): 
UnsafeProjection = {
+  val invalidCode = new CodeAndComment("invalid code", Map.empty)
+  CodeGenerator.compile(invalidCode)
+  null.asInstanceOf[UnsafeProjection]
--- End diff --

Nit: do we need the cast here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22103: [SPARK-25113][SQL] Add logging to CodeGenerator w...

2018-08-14 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22103#discussion_r210169785
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1385,9 +1386,15 @@ object CodeGenerator extends Logging {
   try {
 val cf = new ClassFile(new ByteArrayInputStream(classBytes))
 val stats = cf.methodInfos.asScala.flatMap { method =>
-  method.getAttributes().filter(_.getClass.getName == 
codeAttr.getName).map { a =>
+  method.getAttributes().filter(_.getClass eq codeAttr).map { a =>
--- End diff --

> I worried whether the strictness may change behavior.

Right, I can tell. And as I've mentioned above, although my new check is 
stricter, it doesn't make the behavior any "worse" than before, because we're 
reflectively accessing the `code` field immediately after via 
`codeAttrField.get(a)`, and that won't work unless the classes are matching 
exactly.

The old code before my change would actually be too permissive -- in the 
case of class loader mismatch, the old check will allow the it go run to the 
reflective access site, but it'll then fail because reflection doesn't allow 
access from the wrong class.

This can be exemplified by the following pseudocode 
```
val c1 = new URLClassLoader(somePath).loadClass("Foo") // load a class
val c2 = new URLClassLoader(somePath).loadClass("Foo") // load another 
class with the same name from the same path, but different class loader
val nameEq = c1.getName == c2.getName // true
val refEq = c1 eq c2 // false
val f1 = c1.getClass.getField("a")
val o1 = c1.newInstance
val o2 = c2.newInstance
f1.get(o1) // okay
f1.get(o2) // fail with exception
``` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22103: [SPARK-25113][SQL] Add logging to CodeGenerator w...

2018-08-14 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22103#discussion_r210130579
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1385,9 +1386,15 @@ object CodeGenerator extends Logging {
   try {
 val cf = new ClassFile(new ByteArrayInputStream(classBytes))
 val stats = cf.methodInfos.asScala.flatMap { method =>
-  method.getAttributes().filter(_.getClass.getName == 
codeAttr.getName).map { a =>
+  method.getAttributes().filter(_.getClass eq codeAttr).map { a =>
--- End diff --

The `getName` was accessing the class name unnecessarily and then doing 
string comparison unnecessarily. Just changing it when touching the code around 
it.
The JVM guarantees that a `(defining class loader, full class name)` pair 
is unique at runtime, in this case the `java.lang.Class` instance is guaranteed 
to be unique, so a reference equality check is fast and sufficient.
There's no worry of cross class loader issue here, because if there is, the 
code that follows won't work anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22103: [SPARK-25113][SQL] Add logging to CodeGenerator w...

2018-08-14 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22103#discussion_r210129972
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1385,9 +1386,15 @@ object CodeGenerator extends Logging {
   try {
 val cf = new ClassFile(new ByteArrayInputStream(classBytes))
 val stats = cf.methodInfos.asScala.flatMap { method =>
-  method.getAttributes().filter(_.getClass.getName == 
codeAttr.getName).map { a =>
+  method.getAttributes().filter(_.getClass eq codeAttr).map { a =>
 val byteCodeSize = 
codeAttrField.get(a).asInstanceOf[Array[Byte]].length
 
CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(byteCodeSize)
+
+if (byteCodeSize > DEFAULT_JVM_HUGE_METHOD_LIMIT) {
+  logInfo("Generated method too long to be JIT compiled: " +
--- End diff --

I'd say either way is fine. They're different tenses and the nuances are 
slightly different.

"This story is too good to be true"
vs
"A story too good to be true"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22103: [SPARK-25113][SQL] Add logging to CodeGenerator w...

2018-08-14 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/22103

[SPARK-25113][SQL] Add logging to CodeGenerator when any generated method's 
bytecode size goes above HugeMethodLimit

## What changes were proposed in this pull request?

Add logging for all generated methods from the `CodeGenerator` whose 
bytecode size goes above 8000 bytes.
This is to help with gathering stats on how often Spark is generating 
methods too big to be JIT'd. It covers all codegen scenarios, include 
whole-stage codegen and also individual expression codegen, e.g. unsafe 
projection, mutable projection, etc.

## How was this patch tested?

Manually tested that logging did happen when generated method was above 
8000 bytes.
Also added a new unit test case to `CodeGenerationSuite` to verify that the 
logging did happen.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark codegen-8k-logging

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22103.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22103


commit 640c9cd3b99d51f38c9b1a0c3f94bae676d11e51
Author: Kris Mok 
Date:   2018-08-14T10:50:01Z

SPARK-25113: Add logging to CodeGenerator when any generated method's 
bytecode size goes above HugeMethodLimit




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21951: [SPARK-24957][SQL][FOLLOW-UP] Clean the code for AVERAGE

2018-08-01 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21951
  
LGTM as well. Thanks a lot!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21699: [SPARK-24722][SQL] pivot() with Column type argument

2018-07-03 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21699
  
This mostly looks good, but I'd like to ask a few things first:

1. The new overloaded `pivot()` that takes `Column` only exist for 
`pivot(Column, Seq[Any])`. Were you planning to add a new overload for each 
existing `String` version, e.g. `pivot(Column)` and `pivot(Column, 
java.util.List[Any])`?
2. Since you're adding the `Column` version(s) to address accessing nested 
columns, would it be nice to highlight that capability in the doc example? (Yes 
you've already included that in test case examples so that's already good)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21643: [SPARK-24659][SQL] GenericArrayData.equals should...

2018-06-26 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21643#discussion_r198370477
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ComplexDataSuite.scala
 ---
@@ -104,4 +104,40 @@ class ComplexDataSuite extends SparkFunSuite {
 // The copied data should not be changed externally.
 assert(copied.getStruct(0, 1).getUTF8String(0).toString == "a")
   }
+
+  test("SPARK-24659: GenericArrayData.equals should respect element type 
differences") {
+import scala.reflect.ClassTag
--- End diff --

Thanks for your suggestion! I'm used to making one-off imports inside a 
function when an import is only used within that function, so that the scope is 
as narrow as possible without being disturbing.
Are there any Spark coding style guidelines that suggest otherwise? If so 
I'll follow the guideline and always import at the beginning of the file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21643: [SPARK-24659][SQL] GenericArrayData.equals should...

2018-06-26 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21643#discussion_r198268691
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ComplexDataSuite.scala
 ---
@@ -104,4 +104,38 @@ class ComplexDataSuite extends SparkFunSuite {
 // The copied data should not be changed externally.
 assert(copied.getStruct(0, 1).getUTF8String(0).toString == "a")
   }
+
+  test("SPARK-24659: GenericArrayData.equals should respect element type 
differences") {
+import scala.reflect.ClassTag
+
+// Expected positive cases
+def arraysShouldEqual[T: ClassTag](element: T*): Unit = {
+  val array1 = new GenericArrayData(Array[T](element: _*))
+  val array2 = new GenericArrayData(Array[T](element: _*))
+  assert(array1.equals(array2))
+}
+arraysShouldEqual(true, false)// Boolean
+arraysShouldEqual(0.toByte, 123.toByte, (-123).toByte)// Byte
+arraysShouldEqual(0.toShort, 123.toShort, (-256).toShort) // Short
+arraysShouldEqual(0, 123, -65536) // Int
+arraysShouldEqual(0L, 123L, -65536L)  // Long
--- End diff --

That's a good one. I can do that (and NaNs/Infinity for floating point 
types too)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21643: [SPARK-24659][SQL] GenericArrayData.equals should respec...

2018-06-26 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21643
  
Thanks for your comments, @MaxGekk @maropu @cloud-fan , I've tweaked the 
unit test case to address your comments. Please check it out again.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21643: [SPARK-24659][SQL] GenericArrayData.equals should respec...

2018-06-26 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21643
  
https://github.com/apache/spark/pull/21643#pullrequestreview-131977512
For all practical purposes, no, this change shouldn't break any use cases 
that we support; there could be use cases that we didn't intend to support that 
might break.

In Spark SQL, SQL/DataFrame/Dataset operations are all type-checked, and 
the analyzer will strictly reject operations that involve incompatible types.
Array types in Spark SQL are neither covariant nor contravariant, so e.g. 
`array` and `array` would be considered incompatible, and the 
analyzer would have rejected equality comparison between them in the first 
place.

In all other cases, `GenericArrayData` is a Spark SQL internal type that 
shouldn't be used on the outside, and I don't think we ever intended to support 
direct access of this type from the outside.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21643: [SPARK-24659][SQL] GenericArrayData.equals should...

2018-06-26 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/21643

[SPARK-24659][SQL] GenericArrayData.equals should respect element type 
differences

## What changes were proposed in this pull request?

Fix `GenericArrayData.equals`, so that it respects the actual types of the 
elements.
e.g. an instance that represents an `array` and another instance that 
represents an `array` should be considered incompatible, and thus should 
return false for `equals`.

`GenericArrayData` doesn't keep any schema information by itself, and 
rather relies on the Java objects referenced by its `array` field's elements to 
keep track of their own object types. So, the most straightforward way to 
respect their types is to call `equals` on the elements, instead of using 
Scala's `==` operator, which can have semantics that are not always desirable:
```
new java.lang.Integer(123) == new java.lang.Long(123L) // true in Scala
new java.lang.Integer(123).equals(new java.lang.Long(123L)) // false in 
Scala
```

## How was this patch tested?

Added unit test in `ComplexDataSuite`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark 
fix-genericarraydata-equals

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21643.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21643


commit d91b44accbe40b9879cda259912e3ca38759d716
Author: Kris Mok 
Date:   2018-06-26T11:02:25Z

SPARK-24659: GenericArrayData.equals should respect element type differences




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21367: [SPARK-24321][SQL] Extract common code from Divide/Remai...

2018-05-22 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21367
  
Updated PR, addressed @cloud-fan and @viirya 's comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-22 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r189785968
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeBlockSuite.scala
 ---
@@ -23,12 +23,20 @@ import org.apache.spark.sql.types.{BooleanType, 
IntegerType}
 
 class CodeBlockSuite extends SparkFunSuite {
 
-  test("Block can interpolate string and ExprValue inputs") {
+  test("Block interpolates string and ExprValue inputs") {
 val isNull = JavaCode.isNullVariable("expr1_isNull")
-val code = code"boolean ${isNull} = 
${JavaCode.defaultLiteral(BooleanType)};"
+val stringLiteral = "false"
+val code = code"boolean $isNull = $stringLiteral;"
 assert(code.toString == "boolean expr1_isNull = false;")
   }
 
+  test("Literals are folded into string code parts instead of block 
inputs") {
--- End diff --

Great, I like it!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-22 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r189786155
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -167,9 +170,40 @@ object Block {
   case other => throw new IllegalArgumentException(
 s"Can not interpolate ${other.getClass.getName} into code 
block.")
 }
-CodeBlock(sc.parts, args)
+
+val (codeParts, blockInputs) = foldLiteralArgs(sc.parts, args)
+CodeBlock(codeParts, blockInputs)
+  }
+}
+  }
+
+  // Folds eagerly the literal args into the code parts.
+  private def foldLiteralArgs(parts: Seq[String], args: Seq[Any]): 
(Seq[String], Seq[Any]) = {
+val codeParts = ArrayBuffer.empty[String]
+val blockInputs = ArrayBuffer.empty[Any]
+
+val strings = parts.iterator
+val inputs = args.iterator
+val buf = new StringBuilder(Block.CODE_BLOCK_BUFFER_LENGTH)
+
+buf append strings.next
+while (strings.hasNext) {
--- End diff --

Looks good


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21367: [SPARK-24321][SQL] Extract common code from Divid...

2018-05-21 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21367#discussion_r189723394
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 ---
@@ -297,13 +279,39 @@ case class Divide(left: Expression, right: 
Expression) extends BinaryArithmetic
   if (${eval1.isNull}) {
 ${ev.isNull} = true;
   } else {
-${ev.value} = $divide;
+${ev.value} = $operation;
   }
 }""")
 }
   }
 }
 
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "expr1 _FUNC_ expr2 - Returns `expr1`/`expr2`. It always 
performs floating point division.",
+  examples = """
+Examples:
+  > SELECT 3 _FUNC_ 2;
+   1.5
+  > SELECT 2L _FUNC_ 2L;
+   1.0
+  """)
+// scalastyle:on line.size.limit
+case class Divide(left: Expression, right: Expression) extends DivModLike {
+
+  override def inputType: AbstractDataType = TypeCollection(DoubleType, 
DecimalType)
+
+  override def symbol: String = "/"
+  override def decimalMethod: String = "$div"
+  override def nullable: Boolean = true
--- End diff --

Oops! Thanks for catching it!
Addressed in an update.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21367: [SPARK-24321][SQL] Extract common code from Divid...

2018-05-18 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21367#discussion_r189404152
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 ---
@@ -220,30 +220,12 @@ case class Multiply(left: Expression, right: 
Expression) extends BinaryArithmeti
   protected override def nullSafeEval(input1: Any, input2: Any): Any = 
numeric.times(input1, input2)
 }
 
-// scalastyle:off line.size.limit
-@ExpressionDescription(
-  usage = "expr1 _FUNC_ expr2 - Returns `expr1`/`expr2`. It always 
performs floating point division.",
-  examples = """
-Examples:
-  > SELECT 3 _FUNC_ 2;
-   1.5
-  > SELECT 2L _FUNC_ 2L;
-   1.0
-  """)
-// scalastyle:on line.size.limit
-case class Divide(left: Expression, right: Expression) extends 
BinaryArithmetic {
-
-  override def inputType: AbstractDataType = TypeCollection(DoubleType, 
DecimalType)
+// Common base trait for Divide and Remainder, since these two classes are 
almost identical
+trait DivModLike extends BinaryArithmetic {
 
-  override def symbol: String = "/"
-  override def decimalMethod: String = "$div"
   override def nullable: Boolean = true
 
-  private lazy val div: (Any, Any) => Any = dataType match {
-case ft: FractionalType => 
ft.fractional.asInstanceOf[Fractional[Any]].div
-  }
-
-  override def eval(input: InternalRow): Any = {
+  protected def evalHelper(input: InternalRow, op: => (Any, Any) => Any): 
Any = {
--- End diff --

Note: `op` is declared as a call-by-name argument to retain the `lazy val` 
semantics of the original code in `Divide.div`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21367: [SPARK-24321][SQL] Extract common code from Divid...

2018-05-18 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/21367

[SPARK-24321][SQL] Extract common code from Divide/Remainder to a base trait

## What changes were proposed in this pull request?

Extract common code from `Divide`/`Remainder` to a new base trait, 
`DivModLike`.

Further refactoring to make `Pmod` work with `DivModLike` is to be done as 
a separate task.

## How was this patch tested?

Existing tests in `ArithmeticExpressionSuite` covers the functionality.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark catalyst-divmod

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21367.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21367


commit 026531c70bfb7e06f8586010fc48a8666ff4df04
Author: Kris Mok <kris.mok@...>
Date:   2018-05-18T21:50:50Z

SPARK-24321: Extract common code from Divide/Remainder to a base trait




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21106: [SPARK-23711][SQL] Add fallback generator for UnsafeProj...

2018-05-18 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21106
  
Should this PR wait for https://github.com/apache/spark/pull/21299 so that 
accessing confs on executors can be made simpler?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-18 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r189229566
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeBlockSuite.scala
 ---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.expressions.codegen.Block._
+import org.apache.spark.sql.types.{BooleanType, IntegerType}
+
+class CodeBlockSuite extends SparkFunSuite {
+
+  test("Block can interpolate string and ExprValue inputs") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val code = code"boolean ${isNull} = 
${JavaCode.defaultLiteral(BooleanType)};"
+assert(code.toString == "boolean expr1_isNull = false;")
+  }
+
+  test("Block.stripMargin") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val value = JavaCode.variable("expr1", IntegerType)
+val code1 =
+  code"""
+   |boolean $isNull = false;
+   |int $value = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin
+val expected =
+  s"""
+|boolean expr1_isNull = false;
+|int expr1 = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin.trim
+assert(code1.toString == expected)
+
+val code2 =
+  code"""
+   >boolean $isNull = false;
+   >int $value = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin('>')
+assert(code2.toString == expected)
+  }
+
+  test("Block can capture input expr values") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val value = JavaCode.variable("expr1", IntegerType)
+val code =
+  code"""
+   |boolean $isNull = false;
+   |int $value = -1;
+  """.stripMargin
+val exprValues = code.exprValues
+assert(exprValues.size == 2)
+assert(exprValues === Set(value, isNull))
+  }
+
+  test("concatenate blocks") {
+val isNull1 = JavaCode.isNullVariable("expr1_isNull")
+val value1 = JavaCode.variable("expr1", IntegerType)
+val isNull2 = JavaCode.isNullVariable("expr2_isNull")
+val value2 = JavaCode.variable("expr2", IntegerType)
+val literal = JavaCode.literal("100", IntegerType)
+
+val code =
+  code"""
+   |boolean $isNull1 = false;
+   |int $value1 = -1;""".stripMargin +
+  code"""
+   |boolean $isNull2 = true;
+   |int $value2 = $literal;""".stripMargin
+
+val expected =
+  """
+   |boolean expr1_isNull = false;
+   |int expr1 = -1;
+   |boolean expr2_isNull = true;
+   |int expr2 = 100;""".stripMargin.trim
+
+assert(code.toString == expected)
+
+val exprValues = code.exprValues
+assert(exprValues.size == 5)
+assert(exprValues === Set(isNull1, value1, isNull2, value2, literal))
+  }
+
+  test("Throws exception when interpolating unexcepted object in code 
block") {
+val obj = TestClass(100)
+val e = intercept[IllegalArgumentException] {
+  code"$obj"
+}
+assert(e.getMessage().contains(s"Can not interpolate 
${obj.getClass.getName}"))
+  }
+
+  test("replace expr values in code block") {
+val statement = JavaCode.expression("1 + 1", IntegerType)
+val isNull = JavaCode.isNullV

[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-18 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r189225184
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -114,6 +114,115 @@ object JavaCode {
   }
 }
 
+/**
+ * A trait representing a block of java code.
+ */
+trait Block extends JavaCode {
+
+  // The expressions to be evaluated inside this block.
+  def exprValues: Set[ExprValue]
+
+  // Returns java code string for this code block.
+  override def toString: String = _marginChar match {
+case Some(c) => code.stripMargin(c).trim
+case _ => code.trim
+  }
+
+  def length: Int = toString.length
+
+  // The leading prefix that should be stripped from each line.
+  // By default we strip blanks or control characters followed by '|' from 
the line.
+  var _marginChar: Option[Char] = Some('|')
+
+  def stripMargin(c: Char): this.type = {
+_marginChar = Some(c)
+this
+  }
+
+  def stripMargin: this.type = {
+_marginChar = Some('|')
+this
+  }
+
+  // Concatenates this block with other block.
+  def + (other: Block): Block
+}
+
+object Block {
+
+  val CODE_BLOCK_BUFFER_LENGTH: Int = 512
+
+  implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks)
+
+  implicit class BlockHelper(val sc: StringContext) extends AnyVal {
+def code(args: Any*): Block = {
+  sc.checkLengths(args)
+  if (sc.parts.length == 0) {
+EmptyBlock
+  } else {
+args.foreach {
+  case _: ExprValue =>
+  case _: Int | _: Long | _: Float | _: Double | _: String =>
+  case _: Block =>
+  case other => throw new IllegalArgumentException(
+s"Can not interpolate ${other.getClass.getName} into code 
block.")
+}
+CodeBlock(sc.parts, args)
--- End diff --

I'm wondering: does it make sense to eagerly fold the literal args into the 
`parts`, instead of carrying them over into `CodeBlock`? That's similar to what 
you're already doing with `CodeBlock.exprValues`, but just done a bit more 
eagerly.

Basically, the only things that are safe to manipulate are the structured 
ones like `ExprValue` and `Block`; none of the others are safe to change 
because we don't know what's in there. So no need to even carry them as 
separate argument into `CodeBlock`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-18 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r189229229
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeBlockSuite.scala
 ---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.expressions.codegen.Block._
+import org.apache.spark.sql.types.{BooleanType, IntegerType}
+
+class CodeBlockSuite extends SparkFunSuite {
+
+  test("Block can interpolate string and ExprValue inputs") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val code = code"boolean ${isNull} = 
${JavaCode.defaultLiteral(BooleanType)};"
+assert(code.toString == "boolean expr1_isNull = false;")
+  }
+
+  test("Block.stripMargin") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val value = JavaCode.variable("expr1", IntegerType)
+val code1 =
+  code"""
+   |boolean $isNull = false;
+   |int $value = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin
+val expected =
+  s"""
+|boolean expr1_isNull = false;
+|int expr1 = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin.trim
+assert(code1.toString == expected)
+
+val code2 =
+  code"""
+   >boolean $isNull = false;
+   >int $value = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin('>')
+assert(code2.toString == expected)
+  }
+
+  test("Block can capture input expr values") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val value = JavaCode.variable("expr1", IntegerType)
+val code =
+  code"""
+   |boolean $isNull = false;
+   |int $value = -1;
+  """.stripMargin
+val exprValues = code.exprValues
+assert(exprValues.size == 2)
+assert(exprValues === Set(value, isNull))
+  }
+
+  test("concatenate blocks") {
+val isNull1 = JavaCode.isNullVariable("expr1_isNull")
+val value1 = JavaCode.variable("expr1", IntegerType)
+val isNull2 = JavaCode.isNullVariable("expr2_isNull")
+val value2 = JavaCode.variable("expr2", IntegerType)
+val literal = JavaCode.literal("100", IntegerType)
+
+val code =
+  code"""
+   |boolean $isNull1 = false;
+   |int $value1 = -1;""".stripMargin +
+  code"""
+   |boolean $isNull2 = true;
+   |int $value2 = $literal;""".stripMargin
+
+val expected =
+  """
+   |boolean expr1_isNull = false;
+   |int expr1 = -1;
+   |boolean expr2_isNull = true;
+   |int expr2 = 100;""".stripMargin.trim
+
+assert(code.toString == expected)
+
+val exprValues = code.exprValues
+assert(exprValues.size == 5)
+assert(exprValues === Set(isNull1, value1, isNull2, value2, literal))
+  }
+
+  test("Throws exception when interpolating unexcepted object in code 
block") {
+val obj = TestClass(100)
+val e = intercept[IllegalArgumentException] {
+  code"$obj"
+}
+assert(e.getMessage().contains(s"Can not interpolate 
${obj.getClass.getName}"))
+  }
+
+  test("replace expr values in code block") {
+val statement = JavaCode.expression("1 + 1", IntegerType)
--- End diff --

Nit: can we rename this `statement` variable to something like `expr` or 
`someExpr` or something? It's an expression not a statement.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21094: [SPARK-24007][SQL] EqualNullSafe for FloatType and Doubl...

2018-04-18 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21094
  
LGTM
Wow, that's a hideous bug...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20772: [SPARK-23628][SQL] calculateParamLength should not retur...

2018-04-12 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20772
  
Just wanna leave a note here that the Janino side of the bug has been fixed:
https://github.com/janino-compiler/janino/pull/46#issuecomment-380799160

https://github.com/janino-compiler/janino/commit/a373ac9bb6d28d51dccb7421ea046d73809d7ce2
With this fix, Janino will correctly throw an exception upon invalid 
parameter size instead of silently allowing it to pass compilation (which leads 
to ClassFormatError later when the generated class is loaded by the JVM).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars...

2018-04-12 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21039
  
I just checked the same test in Build 4695, which still has this change, 
and the test passed:

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4695/testReport/junit/org.apache.spark.sql/TPCDSQuerySuite/q61/
Something weird must have happened when running this test in Build 4694...

re:
> shall we just don't do the nulling out? It wouldn't help the GC a lot.
@cloud-fan I can certainly do that -- just mark the transient, removing the 
nulling code.
But what I'm concerned about is that the test failure was exposing some 
other problem that we might not have anticipated before.

The whole-stage codegen logic in most physical operators assume that 
codegen happens on a single thread. As such, it might use instance fields on 
the operator to pass state between the `doProduce` and `doConsume` methods, as 
is the case with `bufVars` in `HashAggregateExec`.
Now, imagine a scenario where I take a DataFrame `df`:
```scala
val plan = df.queryExecution.executedPlan

// then on Thread1
plan.execute // triggers whole-stage codegen

// at around the same time, on Thread2
plan.execute // also triggers whole-stage codegen
```
Now we're going to be performing whole-stage codegen on 2 different 
threads, at around the same time, on the exact same plan (if there are 
`HashAggregateExec`s involved, they're gonna be the same instances).
In this case, the two threads might accidentally "cross-talk" because of 
the way whole-stage codegen uses instance fields to pass state between 
`doProduce` and `doConsume`. It's basically "pure luck" that both threads 
finish the codegen correctly (despite one thread might be accidentally using 
the state from (overwritten by) another thread).

If this theory holds, nulling out the state introduces a "leak" of the 
"cross-talk", so it's now possible to see an NPE if the timing is just right. 
But it's fairly scary already even without the nulling...

Anyway, I'll resend this PR with only the `@transient` part of the change, 
since that won't cause any regressions for sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars...

2018-04-11 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21039
  
Thanks for reverting it for me. The test failure was definitely related to 
the explicit nulling from this PR, but I can't see how that's possible yet.

First of all, in the build that first introduced my change, build 4693, 
this particular test was passing:

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4693/testReport/junit/org.apache.spark.sql/TPCDSQuerySuite/q61/

The build that failed was the one immediately after that.

Second, the stack trace seen from the failure indicates that 
`doProduceWithoutKeys` is indeed on the stack,  and it was before the line that 
I null'd out `bufVars`, so I can't see how `bufVars` can be null in 
`doConsumeWithoutKeys`.

Stack trace:
```
java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithoutKeys(HashAggregateExec.scala:274)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsume(HashAggregateExec.scala:171)
  at 
org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:209)
  at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:180)
  at 
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:35)
  at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:65)
  at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:182)
...
  at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
  at 
org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:35)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithoutKeys(HashAggregateExec.scala:237)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:163)
```

The relevant line in `doConsumeWithoutKeys` is:

https://github.com/apache/spark/blob/75a183071c4ed2e407c930edfdf721779662b3ee/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L274
It's reading `bufVars`.

The relevant line in `doProduceWithoutKeys` is:

https://github.com/apache/spark/blob/75a183071c4ed2e407c930edfdf721779662b3ee/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L237
It's calling `child.produce()`, and that's before the nulling at line 241.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec....

2018-04-11 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21039#discussion_r180721843
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -236,6 +236,8 @@ case class HashAggregateExec(
  | }
""".stripMargin)
 
+bufVars = null  // explicitly null this field out to allow the 
referent to be GC'd sooner
--- End diff --

The workflow of whole-stage codegen ensures that `doConsumeWithoutKeys` can 
only be on the call stack when `doProduceWithoutKeys` is also on the call 
stack; the liveness of the former is strictly a subset of the latter.
That's because for a plan tree that looks like:
```
+- A
  +- B
+- C
```
The whole-stage codegen system (mostly) works like:
```
A.produce
 |--> B.produce
 | |--> C.produce
 | | |--> B.consume
 | | | |--> A.consume
 | | | ||
 | | | |<---o
 | | |<o
 | |<o
 |<o
 o
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec....

2018-04-11 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/21039

[SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars as transient

## What changes were proposed in this pull request?

Mark `HashAggregateExec.bufVars` as transient to avoid it from being 
serialized.
Also manually null out this field at the end of `doProduceWithoutKeys()` to 
shorten its lifecycle, because it'll no longer be used after that.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark codegen-improve

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21039.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21039


commit 0b5a075b9de40d36379551720d7e0e07ad6deb40
Author: Kris Mok <kris.mok@...>
Date:   2018-04-11T09:35:31Z

SPARK-23960: Mark HashAggregateExec.bufVars as transient




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

2018-04-10 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180598651
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala
 ---
@@ -352,12 +346,10 @@ case class IsNotNull(child: Expression) extends 
UnaryExpression with Predicate {
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 val eval = child.genCode(ctx)
-val value = if (eval.isNull == TrueLiteral) {
-  FalseLiteral
-} else if (eval.isNull == FalseLiteral) {
-  TrueLiteral
-} else {
-  StatementValue(s"(!(${eval.isNull}))", 
CodeGenerator.javaType(dataType))
+val value = eval.isNull match {
+  case TrueLiteral => FalseLiteral
+  case FalseLiteral => TrueLiteral
+  case v => JavaCode.isNullExpression(s"!($v)")
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

2018-04-10 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180575894
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateSafeProjection.scala
 ---
@@ -107,7 +110,7 @@ object GenerateSafeProjection extends 
CodeGenerator[Seq[Expression], Projection]
   final ArrayData $output = new $arrayClass($values);
 """
 
-ExprCode(code, FalseLiteral, VariableValue(output, "ArrayData"))
+ExprCode(code, FalseLiteral, VariableValue(output, classOf[ArrayData]))
--- End diff --

Use `JavaCode.variable(name, javaClass)` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

2018-04-10 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180583517
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import java.lang.{Boolean => JBool}
+
+import scala.language.{existentials, implicitConversions}
+
+import org.apache.spark.sql.types.{BooleanType, DataType}
+
+/**
++ * Trait representing an opaque fragments of java code.
++ */
+trait JavaCode {
+  def code: String
+  override def toString: String = code
+}
+
+/**
+ * Utility functions for creating [[JavaCode]] fragments.
+ */
+object JavaCode {
+  /**
+   * Create a java literal.
+   */
+  def literal(v: String, dataType: DataType): LiteralValue = dataType 
match {
+case BooleanType if v == "true" => TrueLiteral
+case BooleanType if v == "false" => FalseLiteral
+case _ => new LiteralValue(v, CodeGenerator.javaClass(dataType))
+  }
+
+  /**
+   * Create a default literal. This is null for reference types, false for 
boolean types and
+   * -1 for other primitive types.
+   */
+  def defaultLiteral(dataType: DataType): LiteralValue = {
+new LiteralValue(
+  CodeGenerator.defaultValue(dataType, typedNull = true),
+  CodeGenerator.javaClass(dataType))
+  }
+
+  /**
+   * Create a local java variable.
+   */
+  def variable(name: String, dataType: DataType): VariableValue = {
+VariableValue(name, CodeGenerator.javaClass(dataType))
+  }
+
+  /**
+   * Create a local isNull variable.
+   */
+  def isNullVariable(name: String): VariableValue = variable(name, 
BooleanType)
+
+  /**
+   * Create a global java variable.
+   */
+  def global(name: String, dataType: DataType): GlobalValue = {
+GlobalValue(name, CodeGenerator.javaClass(dataType))
+  }
+
+  /**
+   * Create a global isNull variable.
+   */
+  def isNullGlobal(name: String): GlobalValue = global(name, BooleanType)
+
+  /**
+   * Create an expression fragment.
+   */
+  def expression(code: String, dataType: DataType): SimpleExprValue = {
+SimpleExprValue(code, CodeGenerator.javaClass(dataType))
+  }
+
+  /**
+   * Create a isNull expression fragment.
+   */
+  def isNullExpression(code: String): SimpleExprValue = {
+expression(code, BooleanType)
+  }
+}
+
+/**
+ * A typed java fragment that must be a valid java expression.
+ */
+trait ExprValue extends JavaCode {
+  def javaType: Class[_]
+  def isPrimitive: Boolean = javaType.isPrimitive
--- End diff --

This PR got rid of the `canDirectAccess` property from @viirya 's base 
change. Should we add it back?
(and implement that property in concrete classes with an `override def` 
instead of an `override val`)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

2018-04-10 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180575717
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
 ---
@@ -292,8 +292,7 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
  |$writeExpressions
""".stripMargin
 // `rowWriter` is declared as a class field, so we can access it 
directly in methods.
-ExprCode(code, FalseLiteral, StatementValue(s"$rowWriter.getRow()", 
"UnsafeRow",
-  canDirectAccess = true))
+ExprCode(code, FalseLiteral, SimpleExprValue(s"$rowWriter.getRow()", 
classOf[UnsafeRow]))
--- End diff --

Add a `JavaCode.expression(expr: String, javaClass: Class[_])` and use it 
here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

2018-04-10 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180575964
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateSafeProjection.scala
 ---
@@ -76,7 +78,7 @@ object GenerateSafeProjection extends 
CodeGenerator[Seq[Expression], Projection]
  |final InternalRow $output = new $rowClass($values);
""".stripMargin
 
-ExprCode(code, FalseLiteral, VariableValue(output, "InternalRow"))
+ExprCode(code, FalseLiteral, VariableValue(output, 
classOf[InternalRow]))
--- End diff --

Use `JavaCode.variable(name, javaClass)` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

2018-04-10 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180575803
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateSafeProjection.scala
 ---
@@ -128,7 +131,7 @@ object GenerateSafeProjection extends 
CodeGenerator[Seq[Expression], Projection]
   final MapData $output = new $mapClass(${keyConverter.value}, 
${valueConverter.value});
 """
 
-ExprCode(code, FalseLiteral, VariableValue(output, "MapData"))
+ExprCode(code, FalseLiteral, VariableValue(output, classOf[MapData]))
--- End diff --

Use `JavaCode.variable(name, javaClass)` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21016: [SPARK-23947][SQL] Add hashUTF8String convenience...

2018-04-09 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/21016

[SPARK-23947][SQL] Add hashUTF8String convenience method to hasher classes

## What changes were proposed in this pull request?

Add `hashUTF8String()` to the hasher classes to allow Spark SQL codegen to 
generate cleaner code for hashing `UTF8String`s. No change in behavior 
otherwise.

Although with the introduction of SPARK-10399, the code size for hashing 
`UTF8String` is already smaller, it's still good to extract a separate function 
in the hasher classes so that the generated code can stay clean.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark hashutf8

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21016.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21016


commit ead91a453fb6a3f92255d14d37c1d4a6beac22fd
Author: Kris Mok <kris.mok@...>
Date:   2018-04-09T22:30:03Z

SPARK-23947 - Add hashUTF8String convenience method to hasher classes




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19082: [SPARK-21870][SQL] Split aggregation code into sm...

2018-04-05 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/19082#discussion_r179367236
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -257,6 +259,78 @@ case class HashAggregateExec(
  """.stripMargin
   }
 
+  // Extracts all the input variable references for a given `aggExpr`. 
This result will be used
+  // to split aggregation into small functions.
+  private def getInputVariableReferences(
+  context: CodegenContext,
+  aggregateExpression: Expression,
+  subExprs: Map[Expression, SubExprEliminationState]): Set[(String, 
String)] = {
+// `argSet` collects all the pairs of variable names and their types, 
the first in the pair is
+// a type name and the second is a variable name.
+val argSet = mutable.Set[(String, String)]()
+val stack = mutable.Stack[Expression](aggregateExpression)
+while (stack.nonEmpty) {
+  stack.pop() match {
+case e if subExprs.contains(e) =>
+  val exprCode = subExprs(e)
+  if (CodegenContext.isJavaIdentifier(exprCode.value)) {
--- End diff --

Once we have @viirya 's https://github.com/apache/spark/pull/20043 merged 
we won't need the ugly `CodegenContext.isJavaIdentifier` hack any more >_<|||


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20779: [SPARK-23598][SQL] Make methods in BufferedRowIterator p...

2018-03-24 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20779
  
Sorry for the late comment. This PR itself is LGTM. I'd just like to make 
some side comments on why this bug is happening.

Janino uses a peculiar encoding of implementing bridge methods for extended 
accessibility from an inner class to members of its enclosing class. Here we're 
actually hitting a bug in Janino where it missed creating bridge methods on the 
enclosing class (`GeneratedClass...`) for `protected` members that it inherited 
from a base class (`BufferedRowIterator`).
I've seen this bug in Janino before, and I plan to fix it in Janino soon. 
Once it's fixed in Janino, we can safely use `protected` members such as 
`append` and `stopEarly` in nested classes within `GeneratedClass...` again. 
Would anybody be interested in switching these methods back to `protected` once 
it's fixed in Janino and Spark bumps the Janino dependency to that new version?

Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20870: [SPARK-23760][SQL] CodegenContext.withSubExprElimination...

2018-03-21 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20870
  
jenkins retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20870: [SPARK-23760][SQL] CodegenContext.withSubExprElim...

2018-03-20 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20870#discussion_r175986986
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -942,7 +940,7 @@ class CodegenContext {
   def subexpressionEliminationForWholeStageCodegen(expressions: 
Seq[Expression]): SubExprCodes = {
 // Create a clear EquivalentExpressions and SubExprEliminationState 
mapping
 val equivalentExpressions: EquivalentExpressions = new 
EquivalentExpressions
-val subExprEliminationExprs = mutable.HashMap.empty[Expression, 
SubExprEliminationState]
+val localSubExprEliminationExprs = mutable.HashMap.empty[Expression, 
SubExprEliminationState]
--- End diff --

This renaming isn't necessary for the fix per-se, but I'd like to piggyback 
it on this change so that it's clearer that we're not interfering with the 
current CSE state of this CodegenContext here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20870: [SPARK-23760][SQL]: CodegenContext.withSubExprEli...

2018-03-20 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/20870

[SPARK-23760][SQL]: CodegenContext.withSubExprEliminationExprs should 
save/restore CSE state correctly

## What changes were proposed in this pull request?

Fixed `CodegenContext.withSubExprEliminationExprs()` so that it 
saves/restores CSE state correctly.

## How was this patch tested?

Added new unit test to verify that the old CSE state is indeed saved and 
restored around the `withSubExprEliminationExprs()` call. Manually verified 
that this test fails without this patch.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark codegen-subexpr-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20870.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20870






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20753: [SPARK-23582][SQL] StaticInvoke should support interpret...

2018-03-07 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20753
  
@hvanhovell @kiszk Yes, the old school JDK reflection does perform bytecode 
generation to speed it up, since the JDK1.4 era. But the advantage of fully 
optimized `MethodHandle` versus reflection `Method.invoke` is that the former 
performs privilege checks at `MethodHandle` creation time, whereas the latter 
is required to perform privilege checks at every invocation (and that's not 
covered by the generated bytecode). So theoretically the `MethodHandle` version 
can be faster.

But note that even in OpenJDK8, the HotSpot VM is still exploring ways to 
optimize `MethodHandle` performance, and it's somewhat picky about the shape of 
the `MethodHandle`s, so it's not like we will always get better performance 
just by sticking it in place of reflection. It'll need some tuning / 
trial-and-error. We don't have to get it right in the first cut.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20753: [SPARK-23582][SQL] StaticInvoke should support in...

2018-03-07 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20753#discussion_r173075682
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -133,8 +134,21 @@ case class StaticInvoke(
   override def nullable: Boolean = needNullCheck || returnNullable
   override def children: Seq[Expression] = arguments
 
-  override def eval(input: InternalRow): Any =
-throw new UnsupportedOperationException("Only code-generated 
evaluation is supported.")
+  override def eval(input: InternalRow): Any = {
+if (staticObject == null) {
--- End diff --

We don't need this check, for sure. See line 132,
```
val objectName = staticObject.getName.stripSuffix("$")
```
^^ this will run as a part of the constructor, which will throw an NPE if 
staticObject is `null`, so it's redundant to null check it here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20757: [SPARK-23595][SQL] ValidateExternalType should su...

2018-03-07 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20757#discussion_r173010942
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -1408,11 +1409,37 @@ case class ValidateExternalType(child: Expression, 
expected: DataType)
 
   override def dataType: DataType = 
RowEncoder.externalDataTypeForInput(expected)
 
-  override def eval(input: InternalRow): Any =
-throw new UnsupportedOperationException("Only code-generated 
evaluation is supported")
-
   private val errMsg = s" is not a valid external type for schema of 
${expected.simpleString}"
 
+  private lazy val checkType = expected match {
+case _: DecimalType =>
+  (value: Any) => {
+Seq(classOf[java.math.BigDecimal], classOf[scala.math.BigDecimal], 
classOf[Decimal])
+  .exists { x => value.getClass.isAssignableFrom(x) }
+  }
+case _: ArrayType =>
+  (value: Any) => {
+value.getClass.isAssignableFrom(classOf[Seq[_]]) || 
value.getClass.isArray
--- End diff --

For those curious:

In HotSpot, the straightforward interpreter/C1 implementation of 
`xxx.getClass().isArray()` path is actually something like:
```
// for getClass()
klazz = xxx._klass; // read the hidden klass pointer field from the object 
header
clazz = klazz._java_mirror; // read the java.lang.Class reference from the 
Klass
// for clazz.isArray(): go through JNI and call the native 
JVM_IsArrayClass() inside HotSpot
klazz1 = clazz->_klass;
result = klazz1->oop_is_array();
```
So a JNI native method call is involved and that's not really fast. But C2 
will optimize this into something similar to:
```
klazz = xxx._klass;
result = inlined klazz->oop_is_array();
```
So that's pretty fast. No need to load the `java.lang.Class`  (aka "Java 
Mirroe") reference anymore.

In the `xxx.isInstanceOf[Seq[_]]` case, again the interpreter version would 
go through a JNI native method call, whereas the C1/C2 versions will inline a 
fast path logic and do a quick comparison against a per-type cache. This fast 
path check has similar overhead to the C2 `isArray()` overhead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20757: [SPARK-23595][SQL] ValidateExternalType should su...

2018-03-07 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20757#discussion_r173006281
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -1408,11 +1409,37 @@ case class ValidateExternalType(child: Expression, 
expected: DataType)
 
   override def dataType: DataType = 
RowEncoder.externalDataTypeForInput(expected)
 
-  override def eval(input: InternalRow): Any =
-throw new UnsupportedOperationException("Only code-generated 
evaluation is supported")
-
   private val errMsg = s" is not a valid external type for schema of 
${expected.simpleString}"
 
+  private lazy val checkType = expected match {
+case _: DecimalType =>
+  (value: Any) => {
+Seq(classOf[java.math.BigDecimal], classOf[scala.math.BigDecimal], 
classOf[Decimal])
+  .exists { x => value.getClass.isAssignableFrom(x) }
+  }
+case _: ArrayType =>
+  (value: Any) => {
+value.getClass.isAssignableFrom(classOf[Seq[_]]) || 
value.getClass.isArray
--- End diff --

Hi guys, sorry I'm late.

In your new code you're doing:
```diff
+case _: ArrayType =>
+  (value: Any) => {
+value.getClass.isArray || value.isInstanceOf[Seq[_]]
+  }
```
which is good. `xxx.getClass().isAssignableFrom(some_class_literal)` in the 
old version of this PR is actually backwards, it should have been 
`some_class_literal.isAssignableFrom(xxx.getClass())`, e.g.
```
scala> classOf[String].isAssignableFrom(classOf[Object])
res0: Boolean = false

scala> classOf[Object].isAssignableFrom(classOf[String])
res1: Boolean = true
```
and the latter is semantically the same as `xxx.isInstanceOf[some_class]`. 
`isInstanceOf[]` is guaranteed to be at least as fast as 
`some_class_literal.isAssignableFrom(xxx.getClass())`, and in general 
`isInstanceOf[]` is faster.

`xxx.getClass().isArray()` has a fixed overhead, whereas `isInstanceOf[]` 
can have a fast path slightly faster than the `isArray` and a slow path that 
can be much slower than `isArray`. So putting the `isArray` check first in your 
new code makes more sense to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20700: [SPARK-23546][SQL] Refactor stateless methods/values in ...

2018-03-02 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20700
  
Aha! Thanks @kiszk san for working on this! I really wanted the stateless 
methods to be extracted so that I can use more utils without having to pass 
around a `CodegenContext` for no good.

I was actually working on the exact same thing last week but got carried 
away on other tasks so my patch is still WIP.
For those curious, here's my WIP patch: 
https://github.com/rednaxelafx/apache-spark/commit/249fb93c5cacc35d53e6b0f972b139d27ef5d720

I was hoping we can cover the ones I extracted here: 
https://github.com/rednaxelafx/apache-spark/commit/249fb93c5cacc35d53e6b0f972b139d27ef5d720#diff-8bcc5aea39c73d4bf38aef6f6951d42cR1201
Your PR and my version extracted mostly the same fields/methods, which is 
great for me that I'll just help get this PR in instead of having to send mine 
out.

The approach is different, though. I'd like to have a discussion on the 
tradeoffs you had in mind when you picked your approach.
I initially did something similar, which is to extract stateless methods 
into companion object methods and change the original one to delegate to the 
companion one. But the old instance methods were a bad smell to me (passing in 
a instance when it shouldn't need the instance), so since we're at it, I 
thought why not just completely remove that bad smell once and for all.
So I deleted the original instance methods, and yes that made the diff huge 
because all use sites of those fields/methods have to be touched to switch to 
using the new version.

@kiszk WDYT? Were you intending to minimize the diff but still want to be 
able to access the stateless util methods as standalone functions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20645: SPARK-23472: Add defaultJavaOptions for drivers and exec...

2018-02-22 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20645
  
A quick question: after this change, `extraJavaOptions` is still able to 
cleanly override whatever's set in `defaultJavaOptions`, is that right?

^^ Just making sure I understood the intent correctly and not the other way 
around. There may well be the other side of administrative needs which is to 
"force options", e.g. force `-XX:-DisableExplicitGC` so that NIO direct memory 
doesn't get into trouble, but that'd be out of scope of this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...

2018-02-16 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20043
  
Hi @viirya, this PR is definitely onto the right direction. I'd like to 
review this PR as well and hopeful help polish it in for post-2.3. Thanks for 
your work!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal

2018-02-16 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20626
  
@hvanhovell Thanks a lot! You're absolutely right. Update the PR 
accordingly. It's passing tests in my local testing and hopefully it'll pass 
the Jenkins tests as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal

2018-02-16 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20626
  
Ah...I see, there are more places where they're statically referencing some 
variable but dynamically those variables would always be null. I'll update the 
PR later to fix those places as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal

2018-02-16 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20626
  
cc @cloud-fan @hvanhovell 
Note: this is for master and branch-2.3 post 2.3.0 release.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal

2018-02-16 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20626
  
So I was able to find quite a few cases where the `DUMMY` placeholder 
caught uses of the `value` field outside of appropriate null-checked regions. 
I'll check the individual cases and then update this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal

2018-02-16 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20626
  
jenkins retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20626: [SPARK-23447][SQL] Cleanup codegen template for L...

2018-02-15 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/20626

[SPARK-23447][SQL] Cleanup codegen template for Literal

## What changes were proposed in this pull request?

Cleaned up the codegen templates for `Literal`s, to make sure that the 
`ExprCode` returned from `Literal.doGenCode()` has:
1. an empty `code` field;
2. an `isNull` field of either literal `true` or `false`;
3. a `value` field that is just a simple literal/constant.

Before this PR, there are a couple of paths that would return a non-trivial 
`code` and all of them are actually unnecessary. The `NaN` and `Infinity` 
constants for `double` and `float` can be accessed through constants directly 
available so there's no need to add a reference for them.

Also took the opportunity to add a new util method for ease of creating 
`ExprCode` for inline-able non-null values.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark codegen-literal

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20626.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20626


commit 68edf0f3463daed3bb7042becb333788b22b23b0
Author: Kris Mok <kris.mok@...>
Date:   2018-02-16T07:44:43Z

Cleanup codegen templates for Literals: make sure the `code` field is empty 
and the `value` field is a simple literal.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20612: [SPARK-23424][SQL]Add codegenStageId in comment

2018-02-15 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20612
  
LGTM. Let's wait for one more LGTM from @gatorsmile / @cloud-fan .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20599: [SPARK-23407][SQL] add a config to try to inline ...

2018-02-14 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20599#discussion_r168101849
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -675,6 +675,16 @@ object SQLConf {
   "disable logging or -1 to apply no limit.")
 .createWithDefault(1000)
 
+  val CODEGEN_TRY_INLINE_ALL_STATES =
+buildConf("spark.sql.codegen.tryInlineAllStates")
+.internal()
+.doc("When adding mutable states during code generation, whether or 
not we should try to " +
+  "inline all the states. If this config is false, we only try to 
inline primitive stats, " +
+  "so that primitive states are more likely to be inlined. Set this 
config to true to make " +
+  "the behavior same as Spark 2.2.")
--- End diff --

Also watch out for a typo `s/stats/states/`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId i...

2018-02-12 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20419#discussion_r167775177
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1226,14 +1226,21 @@ class CodegenContext {
 
   /**
* Register a comment and return the corresponding place holder
+   *
+   * @param placeholderId a string for a place holder
--- End diff --

Nit: can we rephrase this ScalaDoc a bit, maybe like:
```scala
/**
 * ...
 * @param placeholderId an optionally specified identifier for the 
comment's placeholder. The caller should make sure this identifier is unique 
within the compilation unit. If this argument is not specified, a fresh 
identifier will be automatically created and used as the placeholder.
 * ...
 */
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...

2018-02-01 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20419
  
@kiszk For this specific kind of usage, I don't think using a hardcoded 
stable ID will be a problem.
The comment we're talking about is the kind the can only appear once in a 
single whole-stage codegen unit -- there's only one `GeneratedIterator` class 
with the `processNext()` method as the main entry point. Providing such a 
mechanism allows codegen developers to mark important information without 
risking to affect codegen cache behavior.

For safety, we can add a runtime check. Let's assume this new method is 
called `ctx.registerCommentWithId()`, then inside this method we can implement 
something very similar to `ctx.freshName()`, but instead of producing a new 
name with a potential integer suffix appended, we can either log a warning 
message or throw an exception. That way the developer would have an easy way to 
catch this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...

2018-01-31 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20419
  
I gave it a bit more thought. Here's an alternative proposal: instead of 
using a "force comment" mechanism in the current form (which still gets a 
`ctx.freshName("c")` that the caller has no control over), why don't we provide 
an alternative API that also forces comment, but allows the caller to specify a 
(within the `CodegenContext`) unique ID for the placeholder comment?

In this case, instead of using `/*wholestagecodegen_c*/` as the 
placeholder, which can be brittle if someone adds new calls to 
`ctx.registerComment()` in `WholeStageCodegenExec`, we can call 
`ctx.registerComment(comment, placeholderId = "wsc_codegenStageId")`, which we 
can be sure that it'll end up creating a placeholder comment of 
`/*wsc_codegenStageId*/`, stable across all whole-stage generated main classes, 
that would never affect the codegen cache behavior.

@kiszk WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...

2018-01-31 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20419
  
I did the experiment that I asked about at 
https://github.com/apache/spark/pull/20419#issuecomment-361359825 , and 
verified that under the current implementation, this PR will not affect the 
codegen cache hit behavior.

But the way it's working is a bit brittle. The `body` field of the 
`CodeAndComment` returned from `WholeStageCodegenExec.doCodeGen()` looks like 
the following, when `spark.sql.codegen.comments=false && 
spark.sql.codegen.useIdInClassName=false`:
```java
/*wholestagecodegen_c*/
final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
```
Notice the `/*wholestagecodegen_c*/` placeholder comment. It's the artifact 
produced by the `ctx.registerComment()` call, where `wholestagecodegen_` is a 
prefix automatically derived from the calling class, and `c` is from a 
`ctx.freshName("c")`, subject to an integer suffix when names collide.

The way it's used right now, because from `WholeStageCodegenExec` we're 
only making very few calls to `ctx.registerComment()` and all of them happen in 
a deterministic order, that's how this identifier embedded in the placeholder 
comment would be the same no matter what codegen stage it is.

But let's assume if someone doesn't know about this, and then accidentally 
added code in `WholeStageCodegenExec` that conditionally calls 
`ctx.registerComment()`, then this placeholder comment is NOT guaranteed to be 
the same, and that will affect the codegen cache hit behavior.

I'd say functional wise this PR is good to go; maybe we want to add a 
comment somewhere in `WholeStageCodegenExec` to note the subtlety, or maybe 
think about an alternative sometime in the future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...

2018-01-30 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20419
  
@gatorsmile Not directly. The `CodeAndComment` case class is just a 
"container", it doesn't handle what gets into the `body` field. When we force 
embed a comment, it'll leave a comment as a placeholder in the generated code, 
which is in the `body` (the actual comment contents are in the `comment` map). 
I was just curious if any reviews knows off the top of their heads whether or 
not this placeholder may affect equality.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20434: [SPARK-23267] [SQL] Increase spark.sql.codegen.hu...

2018-01-30 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20434#discussion_r164687283
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -660,12 +660,10 @@ object SQLConf {
   val WHOLESTAGE_HUGE_METHOD_LIMIT = 
buildConf("spark.sql.codegen.hugeMethodLimit")
 .internal()
 .doc("The maximum bytecode size of a single compiled Java function 
generated by whole-stage " +
-  "codegen. When the compiled function exceeds this threshold, " +
-  "the whole-stage codegen is deactivated for this subtree of the 
current query plan. " +
-  s"The default value is 
${CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT} and " +
-  "this is a limit in the OpenJDK JVM implementation.")
--- End diff --

The 8000 byte limit is a HotSpot-specific thing, but the 64KB limit is 
imposed by the Java Class File format, as a part of the JVM spec.

We may want to wordsmith a bit here to explain that:
1. 65535 is a largest bytecode size possible for a valid Java method; 
setting the default value to 65535 is effectively turning the limit off for 
whole-stage codegen;
2. For those that wish to turn this limit on when running on HotSpot, it 
may be preferable to set the value to 
`CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT` to match HotSpot's implementation.

I don't have a good concrete suggestion as to how to concisely expression 
these two points in the doc string, though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...

2018-01-29 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20419
  
@kiszk SGTM and LGTM. Let's ship it!

One more question on the side: with the `forceComment = true`, are we fully 
sure that won't affect the equality of `CodeAndComment`?
The whole point of this PR is to have a way to embed the ID into the 
non-executable code of the generated code so that it won't affect the codegen 
cache hit, right?

Could you please post an example of either the generated code or the 
metrics, e.g. a `SortMergeJoin` on two identical `spark.range(3)`s, and confirm 
that even when the two `range()`s are codegen's into different 
`codegenStageId`s, with the `spark.sql.codegen.useIdInClassName` turned off, 
the two stages would still hit the codegen cache? Basically to verify the 
example I gave in 
https://github.com/apache/spark/pull/20224#issuecomment-357091842 still hits 
the codegen cache when `spark.sql.codegen.useIdInClassName=false`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...

2018-01-28 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20419
  
LGTM, and +1 on @viirya 's idea. I like it better for the comment to be on 
top of the class declaration instead of inside it; but I'm okay either way if 
others have strong opinion otherwise. As long as the comment line is right next 
to the class declaration line, that's good enough for easy pattern matching.

BTW, to save a few characters from the generated code, maybe we don't need 
to generate this comment when the `spark.sql.codegen.useIdInClassName` is 
turned on, since the `codegenStageId` is in the class name already.
But for the ease of having a unified way to access the ID from the comment, 
I like having this comment always there. I guess @kiszk and @viirya are on the 
same side on this one for always having the comment, right? In that case I'd 
agree and let's get this in :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r164019336
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -471,19 +531,21 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 }
""", inlineToOuterClass = true)
 
+val className = generatedClassName()
+
 val source = s"""
   public Object generate(Object[] references) {
-return new GeneratedIterator(references);
+return new $className(references);
   }
 
   ${ctx.registerComment(s"""Codegend pipeline 
for\n${child.treeString.trim}""")}
-  final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
+  final class $className extends 
${classOf[BufferedRowIterator].getName} {
--- End diff --

Yes, I like that idea too. Please ping me on the follow-up PR as well. 
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >