from:"rednaxelafx"

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/23079#discussion_r234508866
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseInPredicateSuite.scala
 ---
@@ -298,6 +299,45 @@ class ReplaceNullWithFalseSuite extends PlanTest {
 testProjection(originalExpr = column, expectedExpr = column)
   }
 
+  test("replace nulls in lambda function of ArrayFilter") {
+val cond = GreaterThan(UnresolvedAttribute("e"), Literal(0))
--- End diff --

Actually I intentionally made all three lambda the same (the `MapFilter` 
one only differs in the lambda parameter). I can encapsulate this lambda 
function into a test utility function. Let me update the PR and see what you 
think.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInP...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/23079#discussion_r234508561
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -767,6 +767,15 @@ object ReplaceNullWithFalse extends Rule[LogicalPlan] {
   replaceNullWithFalse(cond) -> value
 }
 cw.copy(branches = newBranches)
+  case af @ ArrayFilter(_, lf @ LambdaFunction(func, _, _)) =>
--- End diff --

I'm not sure if that's useful or not. First of all, the 
`replaceNullWithFalse` handling doesn't apply to all higher-order functions. In 
fact it only applies to a very narrow set, ones where a lambda function returns 
`BooleanType` and is immediately used as a predicate. So having a generic 
utility can certainly help make this PR slightly simpler, but I don't know how 
useful it is for other cases.
I'd prefer waiting for more such transformation cases to introduce a new 
utility for the pattern. WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInPredicat...

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/23079
  
cc @aokolnychyi : I'd like to propose renaming the rule you introduced to 
add a `-InPredicate` suffix, because obviously we can't replace arbitrary 
`null`s with `false`, but only the ones that are going to be directly used in a 
boolean predicate context (e.g. `If(cond, _, _)`, `Filter` etc that you've 
already nicely identified). Does that make sense to you?

BTW thank you very much for introducing that rule. It's really neat!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInP...

GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/23079

[SPARK-26107][SQL] Extend ReplaceNullWithFalseInPredicate to support 
higher-order functions: ArrayExists, ArrayFilter, MapFilter

## What changes were proposed in this pull request?

Extend the `ReplaceNullWithFalse` optimizer rule introduced in SPARK-25860 
(https://github.com/apache/spark/pull/22857) to also support optimizing 
predicates in higher-order functions of `ArrayExists`, `ArrayFilter`, 
`MapFilter`.

Also rename the rule to `ReplaceNullWithFalseInPredicate` to better reflect 
its intent.

## How was this patch tested?

Added new unit test cases to the `ReplaceNullWithFalseInPredicateSuite` 
(renamed from `ReplaceNullWithFalseSuite`).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark catalyst-master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23079.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23079


commit 710c8862b3138f6146fe2309d6379707f8d4ac14
Author: Kris Mok 
Date:   2018-11-18T09:09:53Z

Extend ReplaceNullWithFalseInPredicate to support higher-order functions: 
ArrayExists, ArrayFilter, MapFilter




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

2018-11-16 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r234385204
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallbackSuite.scala
 ---
@@ -17,17 +17,33 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
+import java.util.concurrent.ExecutionException
+
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodeAndComment, 
CodeGenerator}
 import org.apache.spark.sql.catalyst.plans.PlanTestBase
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.sql.types.{IntegerType, LongType}
+import org.apache.spark.sql.types.IntegerType
 
 class CodeGeneratorWithInterpretedFallbackSuite extends SparkFunSuite with 
PlanTestBase {
 
-  test("UnsafeProjection with codegen factory mode") {
-val input = Seq(LongType, IntegerType)
-  .zipWithIndex.map(x => BoundReference(x._2, x._1, true))
+  object FailedCodegenProjection
+  extends CodeGeneratorWithInterpretedFallback[Seq[Expression], 
UnsafeProjection] {
+
+override protected def createCodeGeneratedObject(in: Seq[Expression]): 
UnsafeProjection = {
+  val invalidCode = new CodeAndComment("invalid code", Map.empty)
+  // We assume this compilation throws an exception
--- End diff --

The suggested change is only for making this test suite cleaner, right? In 
that case I'd +1 with the suggestion of being able to clearly check we're 
catching the exception we know we're throwing.
Would you like to submit a PR for it?

> rather than returning `null`
The intent is never to actually reach the null-return, but always cause an 
exception to be thrown at `CodeGenerator.compile()` and abruptly return to the 
caller with the exception. To make the compiler happy you'll have to have some 
definite-returning statement to end the function, so a useless null-return 
would probably have to be there anyway (since the compiler can't tell you'll 
always be throwing an exception unless you do a throw inline)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22976: [SPARK-25974][SQL]Optimizes Generates bytecode fo...

2018-11-15 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22976#discussion_r233771269
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -68,62 +68,55 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
 genComparisons(ctx, ordering)
   }
 
+  /**
+   * Creates the variables for ordering based on the given order.
+   */
+  private def createOrderKeys(
+  ctx: CodegenContext,
+  row: String,
+  ordering: Seq[SortOrder]): Seq[ExprCode] = {
+ctx.INPUT_ROW = row
+// to use INPUT_ROW we must make sure currentVars is null
+ctx.currentVars = null
--- End diff --

I thought about making the same suggestion, but on a second thought it's 
fine to just leave it here as well. The overhead isn't gonna be high, but it's 
nice to see how the `ctx.INPUT_ROW` and `ctx.currentVars` interact in adjacent 
lines.

So I'm okay either way (as-is or with @viirya 's suggestion)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22976: [SPARK-25974][SQL]Optimizes Generates bytecode for order...

2018-11-15 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22976
  
+1 LGTM, thanks for making this improvement!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23032: [WIP][SPARK-26061][SQL][MINOR] Reduce the number of unus...

2018-11-14 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/23032
  
Ha, it's failing an Avro test case due to:
```
Cause: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 65, Column 33: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 65, 
Column 33: Expression "serializefromobject_isNull_1" is not an rvalue
```
Can be reproduced by `build/sbt -Pavro "avro/test-only *AvroSuite -- -z 
partitioned"`

So there must have been something that mutated the state, that moving the 
`prepareRowVar()` call down by a few lines stomped on some side effects.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23032: [SPARK-26061][SQL][MINOR] Reduce the number of unused Un...

2018-11-14 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/23032
  
cc @viirya @maropu @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23032: [SPARK-XXXXX][SQL][MINOR] Reduce the number of un...

2018-11-14 Thread rednaxelafx

GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/23032

[SPARK-X][SQL][MINOR] Reduce the number of unused UnsafeRowWriters 
created in whole-stage codegen

## What changes were proposed in this pull request?

Reduce the number of unused `UnsafeRowWriter`s created in whole-stage 
generated code.
They come from the `CodegenSupport.consume()` calling `prepareRowVar()`, 
which uses `GenerateUnsafeProjection.createCode()` and registers an 
`UnsafeRowWriter` mutable state, regardless of whether or not the downstream 
(parent) operator will use the `rowVar` or not.
Even when the downstream `doConsume` function doesn't use the `rowVar` 
(i.e. doesn't put `row.code` as a part of this operator's codegen template), 
the registered `UnsafeRowWriter` stays there, which makes the init function of 
the generated code a bit bloated.

This PR doesn't heal the root issue, but makes it slightly less painful: 
when the `doConsume` function is split out, the `prepareRowVar()` function is 
called twice, so it's double the pain of unused `UnsafeRowWriter`s. This PR 
simply moves the original call to `prepareRowVar()` down into the `doConsume` 
split/no-split branch so that we're back to just 1x the pain.

To fix the root issue, something that allows the `CodegenSupport` operators 
to indicate whether or not they're going to use the `rowVar` would be needed. 
That's a much more elaborate change so I'd like to just make a minor fix first.

e.g. for this query: `spark.range(10).as[Long].map(x => x + 1).map(x => x * 
x + 2)`
**Before** (in Spark 2.4.0):
```java
/* 023 */   public void init(int index, scala.collection.Iterator[] inputs) 
{
/* 024 */ partitionIndex = index;
/* 025 */ this.inputs = inputs;
/* 026 */
/* 027 */ range_taskContext_0 = TaskContext.get();
/* 028 */ range_inputMetrics_0 = 
range_taskContext_0.taskMetrics().inputMetrics();
/* 029 */ range_mutableStateArray_0[0] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 030 */ range_mutableStateArray_0[1] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 031 */ range_mutableStateArray_0[2] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 032 */ range_mutableStateArray_0[3] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 033 */ range_mutableStateArray_0[4] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 034 */ range_mutableStateArray_0[5] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 035 */ range_mutableStateArray_0[6] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 036 */ range_mutableStateArray_0[7] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 037 */ range_mutableStateArray_0[8] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 038 */
/* 039 */   }
```
9 `UnsafeRowWriter`s created, 1 actually used.

**After**:
```java
/* 023 */   public void init(int index, scala.collection.Iterator[] inputs) 
{
/* 024 */ partitionIndex = index;
/* 025 */ this.inputs = inputs;
/* 026 */
/* 027 */ range_taskContext_0 = TaskContext.get();
/* 028 */ range_inputMetrics_0 = 
range_taskContext_0.taskMetrics().inputMetrics();
/* 029 */ deserializetoobject_mutableStateArray_0[0] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 030 */ deserializetoobject_mutableStateArray_0[1] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 031 */ deserializetoobject_mutableStateArray_0[2] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 032 */ deserializetoobject_mutableStateArray_0[3] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 033 */ deserializetoobject_mutableStateArray_0[4] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 034 */
/* 035 */   }
```
5 `UnsafeRowWriter`s created, 1 actually used.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark wscg-reduce-rowwriter

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23032.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23032


commit e48ec9da9977846213044c5f6d093202ed0efd60
Author: Kris Mok 
Date:   2018-11-14T09:22:11Z

Reduce the

[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22847
  
Just in case people wonder, the following is the hack patch that I used for 
stress testing code splitting before this PR:
```diff
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -647,11 +647,13 @@ class CodegenContext(val useStreamlining: Boolean) {
* Returns a term name that is unique within this instance of a 
`CodegenContext`.
*/
   def freshName(name: String): String = synchronized {
-val fullName = if (freshNamePrefix == "") {
+// hack: intentionally add a very long prefix (length=300 characters) 
to
+// trigger code splitting more frequently
+val fullName = ("averylongprefix" * 20) + (if (freshNamePrefix == "") {
   name
 } else {
   s"${freshNamePrefix}_$name"
-}
+})
 if (freshNameIds.contains(fullName)) {
   val id = freshNameIds(fullName)
   freshNameIds(fullName) = id + 1
```
Of course, now with this PR, we can simply set the split threshold to a 
very low value (e.g. `1`) to force split.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22847#discussion_r229943260
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -812,6 +812,17 @@ object SQLConf {
 .intConf
 .createWithDefault(65535)
 
+  val CODEGEN_METHOD_SPLIT_THRESHOLD = 
buildConf("spark.sql.codegen.methodSplitThreshold")
+.internal()
+.doc("The threshold of source code length without comment of a single 
Java function by " +
+  "codegen to be split. When the generated Java function source code 
exceeds this threshold" +
+  ", it will be split into multiple small functions. We can't know how 
many bytecode will " +
+  "be generated, so use the code length as metric. A function's 
bytecode should not go " +
+  "beyond 8KB, otherwise it will not be JITted; it also should not be 
too small, otherwise " +
+  "there will be many function calls.")
+.intConf
--- End diff --

Oh I see, you're using the column name...that's not the right place to put 
the "prefix". Column names are almost never carried over to the generated code 
in the current framework (the only exception is the lambda variable name).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22847#discussion_r229942325
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -812,6 +812,17 @@ object SQLConf {
 .intConf
 .createWithDefault(65535)
 
+  val CODEGEN_METHOD_SPLIT_THRESHOLD = 
buildConf("spark.sql.codegen.methodSplitThreshold")
+.internal()
+.doc("The threshold of source code length without comment of a single 
Java function by " +
+  "codegen to be split. When the generated Java function source code 
exceeds this threshold" +
+  ", it will be split into multiple small functions. We can't know how 
many bytecode will " +
+  "be generated, so use the code length as metric. A function's 
bytecode should not go " +
+  "beyond 8KB, otherwise it will not be JITted; it also should not be 
too small, otherwise " +
+  "there will be many function calls.")
+.intConf
--- End diff --

The "freshNamePrefix" prefix is only applied in whole-stage codegen, 

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L87

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L169

It doesn't take any effect in non-whole-stage codegen.

If you intend to stress test expression codegen but don't see the prefix 
being prepended, you're probably not adding it in the right place. Where did 
you add it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22847
  
Can / should we apply the same threshold conf to 
`Expression.reduceCodeSize()`?
cc @yucai @cloud-fan @kiszk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22847
  
Using source code length will have to be a very coarse-grained, "fuzzy" 
heuristic. It's not meant to be accurate. So just pick some number that makes 
sense.
2048 might be a good enough choice for now.

Tidbit: before this PR, since the `1024` split threshold is hard coded, the 
way I stress test the code splitting path is to hack into 
`CodegenContext.freshName()` and artificially prepend a very long prefix to the 
symbol names.
e.g. "project_value_1" becomes 
"averylongprefixrepeatedmultipletimes_project_value_1". That inflates the 
source code size a lot which triggers code splitting a lot more frequently 
under the existing heuristics.
I'm mentioning this here to show that source code length based heuristics 
are inherently imprecise. It'll actually behave slightly differently for 
expressions in different physical operators because of the prefix length is 
different â "serializefromobject_" is much longer than "project_", which is 
longer than "agg_", right?

Side notes for those curious:
(details are way too complicated to be addressed by a source code length 
based heuristic, but I'm making this note just to let people know what's 
involved)

Performance-wise, it might be very interesting to note that, running on 
HotSpot, sometimes it might be better to split into more smaller methods, and 
sometimes it might be better to split into less but larger methods.

The theory is this: if a method is really hot, we'd like it to be 
eventually inlined by the JIT to reduce method invocation overhead and get 
better optimization. In HotSpot, hot methods will eventually be compiled by C2, 
and it uses multiple bytecode size based threshold to determine whether or not 
a callee is fit for inlining.
```
MaxInlineSize = 35 // in bytes of bytecode. "The maximum bytecode size of a 
method to be inlined"
FreqInlineSize = 325 // in bytes of bytecode. "The maximum bytecode size of 
a frequent method to be inlined"
MaxTrivialSize = 6 // in bytes of bytecode. "The maximum bytecode size of a 
trivial method to be inlined"
InlineFrequencyRatio = 20 // "Ratio of call site execution to caller method 
invocation"
// there's also a InlineFrequencyCount which we'll omit here for simplicity
```
(values shown above are default values for C2 on x86.)
c.f. 
http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/tip/src/share/vm/runtime/globals.hpp

Don't be confused by the name, though.
`MaxTrivialSize` is meant for regular getters/setters, and they're 
considered to be always inlineable (unless explicitly configured as 
`dontinline` in a compiler directive option).
`MaxInlineSize` is for regular small methods, with more checks than the 
"trivial" case, without checking the invocation frequency.
`InlineFrequencyRatio` is the callee-to-caller invocation ratio for a call 
site to be considered as "hot" (or "frequent"). The default is 20:1. This is 
usually for call sites inside of loops, and HotSpot can collect loop counts / 
invocation counts in profiling, to figure out the ratio.
Similarly, there's a `InlineFrequencyCount` that if a call site (after 
scaling according to inline depth etc) is invoked more than this many times, it 
can also be considered as "hot".
`FreqInlineSize` is the max size a "hot" call site can be considered for 
inlining. It's 325 bytes of bytecode, obviously larger than `MaxInlineSize` so 
it might sound confusing at first ;-)

So the deal is: on HotSpot, if you want to save bytecode code size by
- Outlining a utility method that's <= 6 bytes, there's no overhead at all. 
But in practice this is not really feasible, because at the call site of this 
utility method, "invokespecial" itself is 3 bytes, and you'd at least need to 
pass in "this" which is 1 byte, and if you need to pass in arguments (e.g. 
INPUT_ROW) that's at least 1 or 2 more bytes, which already adds up to 4 to 5 
bytes at the call site, whereas the callee's method body can only be <= 6 
bytes, there's not much you can doâ¦
- Outlining a utility method that's <= 35 bytes. That's the easiest for 
both HotSpot C1 and C2 to inline. You can do somethings but not a lot in this 
method. If you want to shoot for this case, then don't make the split threshold 
too big.
- Outlining a utility method that's <= 325 bytes. In HotSpot, only C2 gets 
to inline big methods like this, and only "hot call sites" are considered.

So, if a method is compute-intensive and is small enough, it'd better to 
strive to keep the split-off methods within the thresholds mentioned above.
Otherwise, if it's alrea

[GitHub] spark pull request #22506: [SPARK-25494][SQL] Upgrade Spark's use of Janino ...

2018-09-20 Thread rednaxelafx

GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/22506

[SPARK-25494][SQL] Upgrade Spark's use of Janino to 3.0.10

## What changes were proposed in this pull request?

This PR upgrades Spark's use of Janino from 3.0.9 to 3.0.10.
Note that 3.0.10 is a out-of-band release specifically for fixing an 
integer overflow issue in Janino's `ClassFile` reader. It is otherwise exactly 
the same as 3.0.9, so it's a low risk and compatible upgrade.

The integer overflow issue affects Spark SQL's codegen stats collection: 
when a generated Class file is huge, especially when the constant pool size is 
above `Short.MAX_VALUE`, Janino's `ClassFile reader` will throw an exception 
when Spark wants to parse the generated Class file to collect stats. So we'll 
miss the stats of some huge Class files.

The related Janino issue is: 
https://github.com/janino-compiler/janino/issues/58

## How was this patch tested?

Existing codegen tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark upgrade-janino

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22506.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22506


commit c3f8a6b41c9409339bf62fb17cd4cd905853d97f
Author: Kris Mok 
Date:   2018-09-20T23:08:27Z

bump Janino to 3.0.10




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22355: [SPARK-25358][SQL] MutableProjection supports fallback t...

2018-09-18 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22355
  
LGTM as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22429: [SPARK-25440][SQL] Dumping query execution info t...

2018-09-17 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22429#discussion_r218235962
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala ---
@@ -189,23 +192,32 @@ class QueryExecution(val sparkSession: SparkSession, 
val logical: LogicalPlan) {
   """.stripMargin.trim
   }
 
+  private def writeOrError(writer: Writer)(f: Writer => Unit): Unit =
+try f(writer) catch { case e: AnalysisException => 
writer.write(e.toString) }
+
+  private def writePlans(writer: Writer): Unit = {
+writer.write("== Parsed Logical Plan ==\n")
+writeOrError(writer)(logical.treeString(_, verbose = true, addSuffix = 
false))
+writer.write("== Analyzed Logical Plan ==\n")
+writeOrError(writer) { writer =>
+  analyzed.output.foreach(o => writer.write(s"${o.name}: 
${o.dataType.simpleString}"))
--- End diff --

If you want to get the best out of this approach, it might be better to 
avoid string interpolation here and do the explicit `writer.write` calls for 
the things you're interpolating on.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22355: [SPARK-25358][SQL] MutableProjection supports fal...

2018-09-12 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22355#discussion_r217016675
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.InternalRow
+
+
+/**
+ * A [[MutableProjection]] that is calculated by calling `eval` on each of 
the specified
+ * expressions.
+ *
+ * @param expressions a sequence of expressions that determine the value 
of each column of the
+ *output row.
+ */
+class InterpretedMutableProjection(expressions: Seq[Expression]) extends 
MutableProjection {
+  def this(expressions: Seq[Expression], inputSchema: Seq[Attribute]) =
+this(expressions.map(toBoundExprs(_, inputSchema)))
--- End diff --

```
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder@2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala:32:
 type mismatch;
[error]  found   : org.apache.spark.sql.catalyst.expressions.Expression
[error]  required: Seq[org.apache.spark.sql.catalyst.expressions.Expression]
[error] this(expressions.map(toBoundExprs(_, inputSchema)))
[error]   ^
```

It's probably `this(toBoundExprs(expressions, inputSchema))` right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22355: [SPARK-25358][SQL] MutableProjection supports fal...

2018-09-10 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22355#discussion_r216525084
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.InternalRow
+
+
+/**
+ * A [[MutableProjection]] that is calculated by calling `eval` on each of 
the specified
+ * expressions.
+ *
+ * @param expressions a sequence of expressions that determine the value 
of each column of the
+ *output row.
+ */
+class InterpretedMutableProjection(expressions: Seq[Expression]) extends 
MutableProjection {
+  def this(expressions: Seq[Expression], inputSchema: Seq[Attribute]) =
+this(expressions.map(BindReferences.bindReference(_, inputSchema)))
--- End diff --

use `toBoundExpr`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22355: [SPARK-25358][SQL] MutableProjection supports fal...

2018-09-10 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22355#discussion_r216524666
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -86,24 +86,12 @@ package object expressions  {
   }
 
   /**
-   * Converts a [[InternalRow]] to another Row given a sequence of 
expression that define each
-   * column of the new row. If the schema of the input row is specified, 
then the given expression
-   * will be bound to that schema.
-   *
-   * In contrast to a normal projection, a MutableProjection reuses the 
same underlying row object
-   * each time an input row is added.  This significantly reduces the cost 
of calculating the
-   * projection, but means that it is not safe to hold on to a reference 
to a [[InternalRow]] after
-   * `next()` has been called on the [[Iterator]] that produced it. 
Instead, the user must call
-   * `InternalRow.copy()` and hold on to the returned [[InternalRow]] 
before calling `next()`.
+   * A helper function to bound given expressions to an input schema.
--- End diff --

Spelling nitpick: s/bound/bind/


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22355: [SPARK-25358][SQL] MutableProjection supports fal...

2018-09-10 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22355#discussion_r216526434
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.InternalRow
+
+
+/**
+ * A [[MutableProjection]] that is calculated by calling `eval` on each of 
the specified
+ * expressions.
+ *
+ * @param expressions a sequence of expressions that determine the value 
of each column of the
+ *output row.
+ */
+class InterpretedMutableProjection(expressions: Seq[Expression]) extends 
MutableProjection {
+  def this(expressions: Seq[Expression], inputSchema: Seq[Attribute]) =
+this(expressions.map(BindReferences.bindReference(_, inputSchema)))
+
+  private[this] val buffer = new Array[Any](expressions.size)
+
+  override def initialize(partitionIndex: Int): Unit = {
+expressions.foreach(_.foreach {
+  case n: Nondeterministic => n.initialize(partitionIndex)
+  case _ =>
+})
+  }
+
+  private[this] val exprArray = expressions.toArray
+  private[this] var mutableRow: InternalRow = new 
GenericInternalRow(exprArray.length)
+  def currentValue: InternalRow = mutableRow
+
+  override def target(row: InternalRow): MutableProjection = {
+mutableRow = row
+this
+  }
+
+  override def apply(input: InternalRow): InternalRow = {
+var i = 0
+while (i < exprArray.length) {
+  // Store the result into buffer first, to make the projection atomic 
(needed by aggregation)
+  buffer(i) = exprArray(i).eval(input)
+  i += 1
+}
+i = 0
+while (i < exprArray.length) {
+  mutableRow(i) = buffer(i)
+  i += 1
+}
+mutableRow
+  }
+}
--- End diff --

+1 on the check for `NoOp`s.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstract meth...

2018-08-23 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22203
  
Thanks @srowen ! In that case could we add a few mentions of such test 
cases that failed into the PR description itself, perhaps a few lines under the 
"How was this patch tested?" section to give a few examples of which tests used 
to fail and are now fixed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstract meth...

2018-08-23 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22203
  
I'm wondering, though, either in the same PR or in a separate PR, could we 
add a test case in Spark that would trigger the "two non-abstract method" error 
so that we'd show an example scenario in Spark that'd trigger the issue and 
also show that Janino 3.0.9 has fixed it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22203: [SPARK-25029][BUILD][CORE] Janino "Two non-abstract meth...

2018-08-23 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22203
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22187: [SPARK-25178][SQL] change the generated code of the keyS...

2018-08-22 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/22187
  
So the new solution now is to directly ship the `StructType` object as a 
reference object. Why not? ;-)
I'm +1 on shipping the object directly instead of generating code to 
recreate it on the executor. If other reviewers are also on board with this 
approach, let's update the PR title and the JIRA ticket title to reflect that 
(since we're no longer generating any "names" at all now).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211731921
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallback.scala
 ---
@@ -17,24 +17,10 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
-import org.codehaus.commons.compiler.CompileException
-import org.codehaus.janino.InternalCompilerException
-
-import org.apache.spark.TaskContext
+import org.apache.spark.internal.Logging
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.util.Utils
 
-/**
- * Catches compile error during code generation.
- */
-object CodegenError {
-  def unapply(throwable: Throwable): Option[Exception] = throwable match {
-case e: InternalCompilerException => Some(e)
-case e: CompileException => Some(e)
--- End diff --

We didn't in the original PR. @maropu -san added one in this PR which is 
great, plugs the hole.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211731624
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallback.scala
 ---
@@ -63,7 +49,10 @@ abstract class CodeGeneratorWithInterpretedFallback[IN, 
OUT] {
 try {
   createCodeGeneratedObject(in)
 } catch {
-  case CodegenError(_) => createInterpretedObject(in)
+  case _: Exception =>
--- End diff --

Oh yeah that's a good point. +1 on using the `NonFatal` extractor, and we 
probably wanna do the same change for whole-stage codegen's fallback logic as 
well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211731392
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallback.scala
 ---
@@ -63,7 +49,10 @@ abstract class CodeGeneratorWithInterpretedFallback[IN, 
OUT] {
 try {
   createCodeGeneratedObject(in)
 } catch {
-  case CodegenError(_) => createInterpretedObject(in)
+  case _: Exception =>
+// We should have already seen the error message in 
`CodeGenerator`
+logWarning("Expr codegen error and falling back to interpreter 
mode")
--- End diff --

We could; although the detail exception type/message/stack trace info would 
already be printed in the preceding logging from the `CodeGenerator` so I 
wouldn't push strongly for adding the exception type here.
(It certainly is easier to do data science on just single lines of logs 
instead of having to search across a few lines for context, so I do see the 
benefits of adding the type information albeit redundant. I'll leave it up to 
you guys @maropu @viirya @cloud-fan @gatorsmile )


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211487231
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallbackSuite.scala
 ---
@@ -40,4 +55,13 @@ class CodeGeneratorWithInterpretedFallbackSuite extends 
SparkFunSuite with PlanT
   assert(obj.isInstanceOf[InterpretedUnsafeProjection])
 }
   }
+
+  test("fallback to the interpreter mode") {
+val input = Seq(IntegerType).zipWithIndex.map(x => 
BoundReference(x._2, x._1, true))
--- End diff --

I'd also like to suggest changing the other test cases in this suite to do 
something like `Seq(IntegerType).zipWithIndex.map { case (tpe, ordinal) => 
BoundReference(ordinal, tpe, true) }` to get rid of the `_2`, `_1`s. This is 
also optional. WDYT @maropu @viirya ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211484515
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala
 ---
@@ -180,7 +180,10 @@ object UnsafeProjection
 try {
   GenerateUnsafeProjection.generate(unsafeExprs, 
subexpressionEliminationEnabled)
 } catch {
-  case CodegenError(_) => 
InterpretedUnsafeProjection.createProjection(unsafeExprs)
+  case _: Exception =>
+// We should have already see error message in `CodeGenerator`
--- End diff --

Nit: Perhaps `// the error message is already logged in CodeGenerator`? Or 
`// We should have already seen the error message in CodeGenerator`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211485732
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallbackSuite.scala
 ---
@@ -40,4 +55,13 @@ class CodeGeneratorWithInterpretedFallbackSuite extends 
SparkFunSuite with PlanT
   assert(obj.isInstanceOf[InterpretedUnsafeProjection])
 }
   }
+
+  test("fallback to the interpreter mode") {
+val input = Seq(IntegerType).zipWithIndex.map(x => 
BoundReference(x._2, x._1, true))
+val fallback = CodegenObjectFactoryMode.FALLBACK.toString
+withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key -> fallback) {
+  val obj = FailedCodegenProjection.createObject(input)
+  assert(obj.isInstanceOf[InterpretedUnsafeProjection])
--- End diff --

Should/could we also assert on the log message being printed out? Perhaps 
something similar to 
https://github.com/apache/spark/pull/22103/files#diff-cf187b40d98ff322d4bde4185701baa2R508
 ?

This is optional; the fact that we've gotten back an 
`InterpretedUnsafeProjection` is a good enough indicator to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211485314
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallbackSuite.scala
 ---
@@ -40,4 +55,13 @@ class CodeGeneratorWithInterpretedFallbackSuite extends 
SparkFunSuite with PlanT
   assert(obj.isInstanceOf[InterpretedUnsafeProjection])
 }
   }
+
+  test("fallback to the interpreter mode") {
+val input = Seq(IntegerType).zipWithIndex.map(x => 
BoundReference(x._2, x._1, true))
--- End diff --

Nit: if we only wanted a one-element-sequence, would it be cleaner to just 
write
`Seq(BoundReference(0, IntegerType, true))`?
Or for those that prefer the cons list syntax, `BoundReference(0, 
IntegerType, true) :: Nil`. I prefer the former one but I don't have a strong 
opinion here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211485848
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallback.scala
 ---
@@ -63,7 +49,10 @@ abstract class CodeGeneratorWithInterpretedFallback[IN, 
OUT] {
 try {
   createCodeGeneratedObject(in)
 } catch {
-  case CodegenError(_) => createInterpretedObject(in)
+  case _: Exception =>
+// We should have already see error message in `CodeGenerator`
--- End diff --

Same as the comment below.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211484616
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala
 ---
@@ -180,7 +180,10 @@ object UnsafeProjection
 try {
   GenerateUnsafeProjection.generate(unsafeExprs, 
subexpressionEliminationEnabled)
 } catch {
-  case CodegenError(_) => 
InterpretedUnsafeProjection.createProjection(unsafeExprs)
+  case _: Exception =>
+// We should have already see error message in `CodeGenerator`
+logWarning("Expr codegen error and falls back to interpreter mode")
--- End diff --

Nit: I prefer "falling back" instead of "falls back" here. e.g. "Expr 
codegen error, falling back to interpreter mode"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exc...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22154#discussion_r211484374
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallbackSuite.scala
 ---
@@ -18,12 +18,27 @@
 package org.apache.spark.sql.catalyst.expressions
 
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodeAndComment, 
CodeGenerator}
 import org.apache.spark.sql.catalyst.plans.PlanTestBase
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.{IntegerType, LongType}
 
 class CodeGeneratorWithInterpretedFallbackSuite extends SparkFunSuite with 
PlanTestBase {
 
+  object FailedCodegenProjection
+  extends CodeGeneratorWithInterpretedFallback[Seq[Expression], 
UnsafeProjection] {
+
+override protected def createCodeGeneratedObject(in: Seq[Expression]): 
UnsafeProjection = {
+  val invalidCode = new CodeAndComment("invalid code", Map.empty)
+  CodeGenerator.compile(invalidCode)
+  null.asInstanceOf[UnsafeProjection]
--- End diff --

Nit: do we need the cast here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22103: [SPARK-25113][SQL] Add logging to CodeGenerator w...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22103#discussion_r210169785
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1385,9 +1386,15 @@ object CodeGenerator extends Logging {
   try {
 val cf = new ClassFile(new ByteArrayInputStream(classBytes))
 val stats = cf.methodInfos.asScala.flatMap { method =>
-  method.getAttributes().filter(_.getClass.getName == 
codeAttr.getName).map { a =>
+  method.getAttributes().filter(_.getClass eq codeAttr).map { a =>
--- End diff --

> I worried whether the strictness may change behavior.

Right, I can tell. And as I've mentioned above, although my new check is 
stricter, it doesn't make the behavior any "worse" than before, because we're 
reflectively accessing the `code` field immediately after via 
`codeAttrField.get(a)`, and that won't work unless the classes are matching 
exactly.

The old code before my change would actually be too permissive -- in the 
case of class loader mismatch, the old check will allow the it go run to the 
reflective access site, but it'll then fail because reflection doesn't allow 
access from the wrong class.

This can be exemplified by the following pseudocode 
```
val c1 = new URLClassLoader(somePath).loadClass("Foo") // load a class
val c2 = new URLClassLoader(somePath).loadClass("Foo") // load another 
class with the same name from the same path, but different class loader
val nameEq = c1.getName == c2.getName // true
val refEq = c1 eq c2 // false
val f1 = c1.getClass.getField("a")
val o1 = c1.newInstance
val o2 = c2.newInstance
f1.get(o1) // okay
f1.get(o2) // fail with exception
``` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22103: [SPARK-25113][SQL] Add logging to CodeGenerator w...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22103#discussion_r210130579
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1385,9 +1386,15 @@ object CodeGenerator extends Logging {
   try {
 val cf = new ClassFile(new ByteArrayInputStream(classBytes))
 val stats = cf.methodInfos.asScala.flatMap { method =>
-  method.getAttributes().filter(_.getClass.getName == 
codeAttr.getName).map { a =>
+  method.getAttributes().filter(_.getClass eq codeAttr).map { a =>
--- End diff --

The `getName` was accessing the class name unnecessarily and then doing 
string comparison unnecessarily. Just changing it when touching the code around 
it.
The JVM guarantees that a `(defining class loader, full class name)` pair 
is unique at runtime, in this case the `java.lang.Class` instance is guaranteed 
to be unique, so a reference equality check is fast and sufficient.
There's no worry of cross class loader issue here, because if there is, the 
code that follows won't work anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22103: [SPARK-25113][SQL] Add logging to CodeGenerator w...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/22103#discussion_r210129972
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1385,9 +1386,15 @@ object CodeGenerator extends Logging {
   try {
 val cf = new ClassFile(new ByteArrayInputStream(classBytes))
 val stats = cf.methodInfos.asScala.flatMap { method =>
-  method.getAttributes().filter(_.getClass.getName == 
codeAttr.getName).map { a =>
+  method.getAttributes().filter(_.getClass eq codeAttr).map { a =>
 val byteCodeSize = 
codeAttrField.get(a).asInstanceOf[Array[Byte]].length
 
CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(byteCodeSize)
+
+if (byteCodeSize > DEFAULT_JVM_HUGE_METHOD_LIMIT) {
+  logInfo("Generated method too long to be JIT compiled: " +
--- End diff --

I'd say either way is fine. They're different tenses and the nuances are 
slightly different.

"This story is too good to be true"
vs
"A story too good to be true"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22103: [SPARK-25113][SQL] Add logging to CodeGenerator w...

GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/22103

[SPARK-25113][SQL] Add logging to CodeGenerator when any generated method's 
bytecode size goes above HugeMethodLimit

## What changes were proposed in this pull request?

Add logging for all generated methods from the `CodeGenerator` whose 
bytecode size goes above 8000 bytes.
This is to help with gathering stats on how often Spark is generating 
methods too big to be JIT'd. It covers all codegen scenarios, include 
whole-stage codegen and also individual expression codegen, e.g. unsafe 
projection, mutable projection, etc.

## How was this patch tested?

Manually tested that logging did happen when generated method was above 
8000 bytes.
Also added a new unit test case to `CodeGenerationSuite` to verify that the 
logging did happen.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark codegen-8k-logging

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22103.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22103


commit 640c9cd3b99d51f38c9b1a0c3f94bae676d11e51
Author: Kris Mok 
Date:   2018-08-14T10:50:01Z

SPARK-25113: Add logging to CodeGenerator when any generated method's 
bytecode size goes above HugeMethodLimit




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21951: [SPARK-24957][SQL][FOLLOW-UP] Clean the code for AVERAGE

2018-08-01 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21951
  
LGTM as well. Thanks a lot!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21699: [SPARK-24722][SQL] pivot() with Column type argument

2018-07-03 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21699
  
This mostly looks good, but I'd like to ask a few things first:

1. The new overloaded `pivot()` that takes `Column` only exist for 
`pivot(Column, Seq[Any])`. Were you planning to add a new overload for each 
existing `String` version, e.g. `pivot(Column)` and `pivot(Column, 
java.util.List[Any])`?
2. Since you're adding the `Column` version(s) to address accessing nested 
columns, would it be nice to highlight that capability in the doc example? (Yes 
you've already included that in test case examples so that's already good)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21643: [SPARK-24659][SQL] GenericArrayData.equals should...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21643#discussion_r198370477
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ComplexDataSuite.scala
 ---
@@ -104,4 +104,40 @@ class ComplexDataSuite extends SparkFunSuite {
 // The copied data should not be changed externally.
 assert(copied.getStruct(0, 1).getUTF8String(0).toString == "a")
   }
+
+  test("SPARK-24659: GenericArrayData.equals should respect element type 
differences") {
+import scala.reflect.ClassTag
--- End diff --

Thanks for your suggestion! I'm used to making one-off imports inside a 
function when an import is only used within that function, so that the scope is 
as narrow as possible without being disturbing.
Are there any Spark coding style guidelines that suggest otherwise? If so 
I'll follow the guideline and always import at the beginning of the file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21643: [SPARK-24659][SQL] GenericArrayData.equals should...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21643#discussion_r198268691
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ComplexDataSuite.scala
 ---
@@ -104,4 +104,38 @@ class ComplexDataSuite extends SparkFunSuite {
 // The copied data should not be changed externally.
 assert(copied.getStruct(0, 1).getUTF8String(0).toString == "a")
   }
+
+  test("SPARK-24659: GenericArrayData.equals should respect element type 
differences") {
+import scala.reflect.ClassTag
+
+// Expected positive cases
+def arraysShouldEqual[T: ClassTag](element: T*): Unit = {
+  val array1 = new GenericArrayData(Array[T](element: _*))
+  val array2 = new GenericArrayData(Array[T](element: _*))
+  assert(array1.equals(array2))
+}
+arraysShouldEqual(true, false)// Boolean
+arraysShouldEqual(0.toByte, 123.toByte, (-123).toByte)// Byte
+arraysShouldEqual(0.toShort, 123.toShort, (-256).toShort) // Short
+arraysShouldEqual(0, 123, -65536) // Int
+arraysShouldEqual(0L, 123L, -65536L)  // Long
--- End diff --

That's a good one. I can do that (and NaNs/Infinity for floating point 
types too)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21643: [SPARK-24659][SQL] GenericArrayData.equals should respec...

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21643
  
Thanks for your comments, @MaxGekk @maropu @cloud-fan , I've tweaked the 
unit test case to address your comments. Please check it out again.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21643: [SPARK-24659][SQL] GenericArrayData.equals should respec...

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21643
  
https://github.com/apache/spark/pull/21643#pullrequestreview-131977512
For all practical purposes, no, this change shouldn't break any use cases 
that we support; there could be use cases that we didn't intend to support that 
might break.

In Spark SQL, SQL/DataFrame/Dataset operations are all type-checked, and 
the analyzer will strictly reject operations that involve incompatible types.
Array types in Spark SQL are neither covariant nor contravariant, so e.g. 
`array` and `array` would be considered incompatible, and the 
analyzer would have rejected equality comparison between them in the first 
place.

In all other cases, `GenericArrayData` is a Spark SQL internal type that 
shouldn't be used on the outside, and I don't think we ever intended to support 
direct access of this type from the outside.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21643: [SPARK-24659][SQL] GenericArrayData.equals should...

GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/21643

[SPARK-24659][SQL] GenericArrayData.equals should respect element type 
differences

## What changes were proposed in this pull request?

Fix `GenericArrayData.equals`, so that it respects the actual types of the 
elements.
e.g. an instance that represents an `array` and another instance that 
represents an `array` should be considered incompatible, and thus should 
return false for `equals`.

`GenericArrayData` doesn't keep any schema information by itself, and 
rather relies on the Java objects referenced by its `array` field's elements to 
keep track of their own object types. So, the most straightforward way to 
respect their types is to call `equals` on the elements, instead of using 
Scala's `==` operator, which can have semantics that are not always desirable:
```
new java.lang.Integer(123) == new java.lang.Long(123L) // true in Scala
new java.lang.Integer(123).equals(new java.lang.Long(123L)) // false in 
Scala
```

## How was this patch tested?

Added unit test in `ComplexDataSuite`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark 
fix-genericarraydata-equals

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21643.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21643


commit d91b44accbe40b9879cda259912e3ca38759d716
Author: Kris Mok 
Date:   2018-06-26T11:02:25Z

SPARK-24659: GenericArrayData.equals should respect element type differences




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21367: [SPARK-24321][SQL] Extract common code from Divide/Remai...

2018-05-22 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21367
  
Updated PR, addressed @cloud-fan and @viirya 's comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-22 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r189785968
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeBlockSuite.scala
 ---
@@ -23,12 +23,20 @@ import org.apache.spark.sql.types.{BooleanType, 
IntegerType}
 
 class CodeBlockSuite extends SparkFunSuite {
 
-  test("Block can interpolate string and ExprValue inputs") {
+  test("Block interpolates string and ExprValue inputs") {
 val isNull = JavaCode.isNullVariable("expr1_isNull")
-val code = code"boolean ${isNull} = 
${JavaCode.defaultLiteral(BooleanType)};"
+val stringLiteral = "false"
+val code = code"boolean $isNull = $stringLiteral;"
 assert(code.toString == "boolean expr1_isNull = false;")
   }
 
+  test("Literals are folded into string code parts instead of block 
inputs") {
--- End diff --

Great, I like it!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

2018-05-22 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r189786155
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -167,9 +170,40 @@ object Block {
   case other => throw new IllegalArgumentException(
 s"Can not interpolate ${other.getClass.getName} into code 
block.")
 }
-CodeBlock(sc.parts, args)
+
+val (codeParts, blockInputs) = foldLiteralArgs(sc.parts, args)
+CodeBlock(codeParts, blockInputs)
+  }
+}
+  }
+
+  // Folds eagerly the literal args into the code parts.
+  private def foldLiteralArgs(parts: Seq[String], args: Seq[Any]): 
(Seq[String], Seq[Any]) = {
+val codeParts = ArrayBuffer.empty[String]
+val blockInputs = ArrayBuffer.empty[Any]
+
+val strings = parts.iterator
+val inputs = args.iterator
+val buf = new StringBuilder(Block.CODE_BLOCK_BUFFER_LENGTH)
+
+buf append strings.next
+while (strings.hasNext) {
--- End diff --

Looks good


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21367: [SPARK-24321][SQL] Extract common code from Divid...

2018-05-21 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21367#discussion_r189723394
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 ---
@@ -297,13 +279,39 @@ case class Divide(left: Expression, right: 
Expression) extends BinaryArithmetic
   if (${eval1.isNull}) {
 ${ev.isNull} = true;
   } else {
-${ev.value} = $divide;
+${ev.value} = $operation;
   }
 }""")
 }
   }
 }
 
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "expr1 _FUNC_ expr2 - Returns `expr1`/`expr2`. It always 
performs floating point division.",
+  examples = """
+Examples:
+  > SELECT 3 _FUNC_ 2;
+   1.5
+  > SELECT 2L _FUNC_ 2L;
+   1.0
+  """)
+// scalastyle:on line.size.limit
+case class Divide(left: Expression, right: Expression) extends DivModLike {
+
+  override def inputType: AbstractDataType = TypeCollection(DoubleType, 
DecimalType)
+
+  override def symbol: String = "/"
+  override def decimalMethod: String = "$div"
+  override def nullable: Boolean = true
--- End diff --

Oops! Thanks for catching it!
Addressed in an update.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21367: [SPARK-24321][SQL] Extract common code from Divid...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21367#discussion_r189404152
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 ---
@@ -220,30 +220,12 @@ case class Multiply(left: Expression, right: 
Expression) extends BinaryArithmeti
   protected override def nullSafeEval(input1: Any, input2: Any): Any = 
numeric.times(input1, input2)
 }
 
-// scalastyle:off line.size.limit
-@ExpressionDescription(
-  usage = "expr1 _FUNC_ expr2 - Returns `expr1`/`expr2`. It always 
performs floating point division.",
-  examples = """
-Examples:
-  > SELECT 3 _FUNC_ 2;
-   1.5
-  > SELECT 2L _FUNC_ 2L;
-   1.0
-  """)
-// scalastyle:on line.size.limit
-case class Divide(left: Expression, right: Expression) extends 
BinaryArithmetic {
-
-  override def inputType: AbstractDataType = TypeCollection(DoubleType, 
DecimalType)
+// Common base trait for Divide and Remainder, since these two classes are 
almost identical
+trait DivModLike extends BinaryArithmetic {
 
-  override def symbol: String = "/"
-  override def decimalMethod: String = "$div"
   override def nullable: Boolean = true
 
-  private lazy val div: (Any, Any) => Any = dataType match {
-case ft: FractionalType => 
ft.fractional.asInstanceOf[Fractional[Any]].div
-  }
-
-  override def eval(input: InternalRow): Any = {
+  protected def evalHelper(input: InternalRow, op: => (Any, Any) => Any): 
Any = {
--- End diff --

Note: `op` is declared as a call-by-name argument to retain the `lazy val` 
semantics of the original code in `Divide.div`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21367: [SPARK-24321][SQL] Extract common code from Divid...

GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/21367

[SPARK-24321][SQL] Extract common code from Divide/Remainder to a base trait

## What changes were proposed in this pull request?

Extract common code from `Divide`/`Remainder` to a new base trait, 
`DivModLike`.

Further refactoring to make `Pmod` work with `DivModLike` is to be done as 
a separate task.

## How was this patch tested?

Existing tests in `ArithmeticExpressionSuite` covers the functionality.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark catalyst-divmod

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21367.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21367


commit 026531c70bfb7e06f8586010fc48a8666ff4df04
Author: Kris Mok <kris.mok@...>
Date:   2018-05-18T21:50:50Z

SPARK-24321: Extract common code from Divide/Remainder to a base trait




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21106: [SPARK-23711][SQL] Add fallback generator for UnsafeProj...

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21106
  
Should this PR wait for https://github.com/apache/spark/pull/21299 so that 
accessing confs on executors can be made simpler?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r189229566
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeBlockSuite.scala
 ---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.expressions.codegen.Block._
+import org.apache.spark.sql.types.{BooleanType, IntegerType}
+
+class CodeBlockSuite extends SparkFunSuite {
+
+  test("Block can interpolate string and ExprValue inputs") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val code = code"boolean ${isNull} = 
${JavaCode.defaultLiteral(BooleanType)};"
+assert(code.toString == "boolean expr1_isNull = false;")
+  }
+
+  test("Block.stripMargin") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val value = JavaCode.variable("expr1", IntegerType)
+val code1 =
+  code"""
+   |boolean $isNull = false;
+   |int $value = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin
+val expected =
+  s"""
+|boolean expr1_isNull = false;
+|int expr1 = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin.trim
+assert(code1.toString == expected)
+
+val code2 =
+  code"""
+   >boolean $isNull = false;
+   >int $value = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin('>')
+assert(code2.toString == expected)
+  }
+
+  test("Block can capture input expr values") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val value = JavaCode.variable("expr1", IntegerType)
+val code =
+  code"""
+   |boolean $isNull = false;
+   |int $value = -1;
+  """.stripMargin
+val exprValues = code.exprValues
+assert(exprValues.size == 2)
+assert(exprValues === Set(value, isNull))
+  }
+
+  test("concatenate blocks") {
+val isNull1 = JavaCode.isNullVariable("expr1_isNull")
+val value1 = JavaCode.variable("expr1", IntegerType)
+val isNull2 = JavaCode.isNullVariable("expr2_isNull")
+val value2 = JavaCode.variable("expr2", IntegerType)
+val literal = JavaCode.literal("100", IntegerType)
+
+val code =
+  code"""
+   |boolean $isNull1 = false;
+   |int $value1 = -1;""".stripMargin +
+  code"""
+   |boolean $isNull2 = true;
+   |int $value2 = $literal;""".stripMargin
+
+val expected =
+  """
+   |boolean expr1_isNull = false;
+   |int expr1 = -1;
+   |boolean expr2_isNull = true;
+   |int expr2 = 100;""".stripMargin.trim
+
+assert(code.toString == expected)
+
+val exprValues = code.exprValues
+assert(exprValues.size == 5)
+assert(exprValues === Set(isNull1, value1, isNull2, value2, literal))
+  }
+
+  test("Throws exception when interpolating unexcepted object in code 
block") {
+val obj = TestClass(100)
+val e = intercept[IllegalArgumentException] {
+  code"$obj"
+}
+assert(e.getMessage().contains(s"Can not interpolate 
${obj.getClass.getName}"))
+  }
+
+  test("replace expr values in code block") {
+val statement = JavaCode.expression("1 + 1", IntegerType)
+val isNull = JavaCode.isNullV

[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r189225184
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -114,6 +114,115 @@ object JavaCode {
   }
 }
 
+/**
+ * A trait representing a block of java code.
+ */
+trait Block extends JavaCode {
+
+  // The expressions to be evaluated inside this block.
+  def exprValues: Set[ExprValue]
+
+  // Returns java code string for this code block.
+  override def toString: String = _marginChar match {
+case Some(c) => code.stripMargin(c).trim
+case _ => code.trim
+  }
+
+  def length: Int = toString.length
+
+  // The leading prefix that should be stripped from each line.
+  // By default we strip blanks or control characters followed by '|' from 
the line.
+  var _marginChar: Option[Char] = Some('|')
+
+  def stripMargin(c: Char): this.type = {
+_marginChar = Some(c)
+this
+  }
+
+  def stripMargin: this.type = {
+_marginChar = Some('|')
+this
+  }
+
+  // Concatenates this block with other block.
+  def + (other: Block): Block
+}
+
+object Block {
+
+  val CODE_BLOCK_BUFFER_LENGTH: Int = 512
+
+  implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks)
+
+  implicit class BlockHelper(val sc: StringContext) extends AnyVal {
+def code(args: Any*): Block = {
+  sc.checkLengths(args)
+  if (sc.parts.length == 0) {
+EmptyBlock
+  } else {
+args.foreach {
+  case _: ExprValue =>
+  case _: Int | _: Long | _: Float | _: Double | _: String =>
+  case _: Block =>
+  case other => throw new IllegalArgumentException(
+s"Can not interpolate ${other.getClass.getName} into code 
block.")
+}
+CodeBlock(sc.parts, args)
--- End diff --

I'm wondering: does it make sense to eagerly fold the literal args into the 
`parts`, instead of carrying them over into `CodeBlock`? That's similar to what 
you're already doing with `CodeBlock.exprValues`, but just done a bit more 
eagerly.

Basically, the only things that are safe to manipulate are the structured 
ones like `ExprValue` and `Block`; none of the others are safe to change 
because we don't know what's in there. So no need to even carry them as 
separate argument into `CodeBlock`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21193: [SPARK-24121][SQL] Add API for handling expressio...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21193#discussion_r189229229
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeBlockSuite.scala
 ---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.expressions.codegen.Block._
+import org.apache.spark.sql.types.{BooleanType, IntegerType}
+
+class CodeBlockSuite extends SparkFunSuite {
+
+  test("Block can interpolate string and ExprValue inputs") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val code = code"boolean ${isNull} = 
${JavaCode.defaultLiteral(BooleanType)};"
+assert(code.toString == "boolean expr1_isNull = false;")
+  }
+
+  test("Block.stripMargin") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val value = JavaCode.variable("expr1", IntegerType)
+val code1 =
+  code"""
+   |boolean $isNull = false;
+   |int $value = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin
+val expected =
+  s"""
+|boolean expr1_isNull = false;
+|int expr1 = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin.trim
+assert(code1.toString == expected)
+
+val code2 =
+  code"""
+   >boolean $isNull = false;
+   >int $value = 
${JavaCode.defaultLiteral(IntegerType)};""".stripMargin('>')
+assert(code2.toString == expected)
+  }
+
+  test("Block can capture input expr values") {
+val isNull = JavaCode.isNullVariable("expr1_isNull")
+val value = JavaCode.variable("expr1", IntegerType)
+val code =
+  code"""
+   |boolean $isNull = false;
+   |int $value = -1;
+  """.stripMargin
+val exprValues = code.exprValues
+assert(exprValues.size == 2)
+assert(exprValues === Set(value, isNull))
+  }
+
+  test("concatenate blocks") {
+val isNull1 = JavaCode.isNullVariable("expr1_isNull")
+val value1 = JavaCode.variable("expr1", IntegerType)
+val isNull2 = JavaCode.isNullVariable("expr2_isNull")
+val value2 = JavaCode.variable("expr2", IntegerType)
+val literal = JavaCode.literal("100", IntegerType)
+
+val code =
+  code"""
+   |boolean $isNull1 = false;
+   |int $value1 = -1;""".stripMargin +
+  code"""
+   |boolean $isNull2 = true;
+   |int $value2 = $literal;""".stripMargin
+
+val expected =
+  """
+   |boolean expr1_isNull = false;
+   |int expr1 = -1;
+   |boolean expr2_isNull = true;
+   |int expr2 = 100;""".stripMargin.trim
+
+assert(code.toString == expected)
+
+val exprValues = code.exprValues
+assert(exprValues.size == 5)
+assert(exprValues === Set(isNull1, value1, isNull2, value2, literal))
+  }
+
+  test("Throws exception when interpolating unexcepted object in code 
block") {
+val obj = TestClass(100)
+val e = intercept[IllegalArgumentException] {
+  code"$obj"
+}
+assert(e.getMessage().contains(s"Can not interpolate 
${obj.getClass.getName}"))
+  }
+
+  test("replace expr values in code block") {
+val statement = JavaCode.expression("1 + 1", IntegerType)
--- End diff --

Nit: can we rename this `statement` variable to something like `expr` or 
`someExpr` or something? It's an expression not a statement.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21094: [SPARK-24007][SQL] EqualNullSafe for FloatType and Doubl...

2018-04-18 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21094
  
LGTM
Wow, that's a hideous bug...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20772: [SPARK-23628][SQL] calculateParamLength should not retur...

2018-04-12 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20772
  
Just wanna leave a note here that the Janino side of the bug has been fixed:
https://github.com/janino-compiler/janino/pull/46#issuecomment-380799160

https://github.com/janino-compiler/janino/commit/a373ac9bb6d28d51dccb7421ea046d73809d7ce2
With this fix, Janino will correctly throw an exception upon invalid 
parameter size instead of silently allowing it to pass compilation (which leads 
to ClassFormatError later when the generated class is loaded by the JVM).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars...

2018-04-12 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21039

I just checked the same test in Build 4695, which still has this change,
and the test passed:

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4695/testReport/junit/org.apache.spark.sql/TPCDSQuerySuite/q61/
Something weird must have happened when running this test in Build 4694...

re:
> shall we just don't do the nulling out? It wouldn't help the GC a lot.
@cloud-fan I can certainly do that -- just mark the transient, removing the
nulling code.
But what I'm concerned about is that the test failure was exposing some
other problem that we might not have anticipated before.

The whole-stage codegen logic in most physical operators assume that
codegen happens on a single thread. As such, it might use instance fields on
the operator to pass state between the `doProduce` and `doConsume` methods, as
is the case with `bufVars` in `HashAggregateExec`.
Now, imagine a scenario where I take a DataFrame `df`:
```scala
val plan = df.queryExecution.executedPlan

// then on Thread1
plan.execute // triggers whole-stage codegen

// at around the same time, on Thread2
plan.execute // also triggers whole-stage codegen
```
Now we're going to be performing whole-stage codegen on 2 different
threads, at around the same time, on the exact same plan (if there are
`HashAggregateExec`s involved, they're gonna be the same instances).
In this case, the two threads might accidentally "cross-talk" because of
the way whole-stage codegen uses instance fields to pass state between
`doProduce` and `doConsume`. It's basically "pure luck" that both threads
finish the codegen correctly (despite one thread might be accidentally using
the state from (overwritten by) another thread).

If this theory holds, nulling out the state introduces a "leak" of the
"cross-talk", so it's now possible to see an NPE if the timing is just right.
But it's fairly scary already even without the nulling...

Anyway, I'll resend this PR with only the `@transient` part of the change,
since that won't cause any regressions for sure.

---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars...

2018-04-11 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/21039
  
Thanks for reverting it for me. The test failure was definitely related to 
the explicit nulling from this PR, but I can't see how that's possible yet.

First of all, in the build that first introduced my change, build 4693, 
this particular test was passing:

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4693/testReport/junit/org.apache.spark.sql/TPCDSQuerySuite/q61/

The build that failed was the one immediately after that.

Second, the stack trace seen from the failure indicates that 
`doProduceWithoutKeys` is indeed on the stack,  and it was before the line that 
I null'd out `bufVars`, so I can't see how `bufVars` can be null in 
`doConsumeWithoutKeys`.

Stack trace:
```
java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithoutKeys(HashAggregateExec.scala:274)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsume(HashAggregateExec.scala:171)
  at 
org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:209)
  at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:180)
  at 
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:35)
  at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:65)
  at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:182)
...
  at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
  at 
org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:35)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithoutKeys(HashAggregateExec.scala:237)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:163)
```

The relevant line in `doConsumeWithoutKeys` is:

https://github.com/apache/spark/blob/75a183071c4ed2e407c930edfdf721779662b3ee/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L274
It's reading `bufVars`.

The relevant line in `doProduceWithoutKeys` is:

https://github.com/apache/spark/blob/75a183071c4ed2e407c930edfdf721779662b3ee/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L237
It's calling `child.produce()`, and that's before the nulling at line 241.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec....

2018-04-11 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21039#discussion_r180721843
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -236,6 +236,8 @@ case class HashAggregateExec(
  | }
""".stripMargin)
 
+bufVars = null  // explicitly null this field out to allow the 
referent to be GC'd sooner
--- End diff --

The workflow of whole-stage codegen ensures that `doConsumeWithoutKeys` can 
only be on the call stack when `doProduceWithoutKeys` is also on the call 
stack; the liveness of the former is strictly a subset of the latter.
That's because for a plan tree that looks like:
```
+- A
  +- B
+- C
```
The whole-stage codegen system (mostly) works like:
```
A.produce
 |--> B.produce
 | |--> C.produce
 | | |--> B.consume
 | | | |--> A.consume
 | | | ||
 | | | |<---o
 | | |<o
 | |<o
 |<o
 o
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec....

2018-04-11 Thread rednaxelafx

GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/21039

[SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars as transient

## What changes were proposed in this pull request?

Mark `HashAggregateExec.bufVars` as transient to avoid it from being 
serialized.
Also manually null out this field at the end of `doProduceWithoutKeys()` to 
shorten its lifecycle, because it'll no longer be used after that.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark codegen-improve

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21039.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21039


commit 0b5a075b9de40d36379551720d7e0e07ad6deb40
Author: Kris Mok <kris.mok@...>
Date:   2018-04-11T09:35:31Z

SPARK-23960: Mark HashAggregateExec.bufVars as transient




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180598651
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala
 ---
@@ -352,12 +346,10 @@ case class IsNotNull(child: Expression) extends 
UnaryExpression with Predicate {
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 val eval = child.genCode(ctx)
-val value = if (eval.isNull == TrueLiteral) {
-  FalseLiteral
-} else if (eval.isNull == FalseLiteral) {
-  TrueLiteral
-} else {
-  StatementValue(s"(!(${eval.isNull}))", 
CodeGenerator.javaType(dataType))
+val value = eval.isNull match {
+  case TrueLiteral => FalseLiteral
+  case FalseLiteral => TrueLiteral
+  case v => JavaCode.isNullExpression(s"!($v)")
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180575894
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateSafeProjection.scala
 ---
@@ -107,7 +110,7 @@ object GenerateSafeProjection extends 
CodeGenerator[Seq[Expression], Projection]
   final ArrayData $output = new $arrayClass($values);
 """
 
-ExprCode(code, FalseLiteral, VariableValue(output, "ArrayData"))
+ExprCode(code, FalseLiteral, VariableValue(output, classOf[ArrayData]))
--- End diff --

Use `JavaCode.variable(name, javaClass)` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180583517
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import java.lang.{Boolean => JBool}
+
+import scala.language.{existentials, implicitConversions}
+
+import org.apache.spark.sql.types.{BooleanType, DataType}
+
+/**
++ * Trait representing an opaque fragments of java code.
++ */
+trait JavaCode {
+  def code: String
+  override def toString: String = code
+}
+
+/**
+ * Utility functions for creating [[JavaCode]] fragments.
+ */
+object JavaCode {
+  /**
+   * Create a java literal.
+   */
+  def literal(v: String, dataType: DataType): LiteralValue = dataType 
match {
+case BooleanType if v == "true" => TrueLiteral
+case BooleanType if v == "false" => FalseLiteral
+case _ => new LiteralValue(v, CodeGenerator.javaClass(dataType))
+  }
+
+  /**
+   * Create a default literal. This is null for reference types, false for 
boolean types and
+   * -1 for other primitive types.
+   */
+  def defaultLiteral(dataType: DataType): LiteralValue = {
+new LiteralValue(
+  CodeGenerator.defaultValue(dataType, typedNull = true),
+  CodeGenerator.javaClass(dataType))
+  }
+
+  /**
+   * Create a local java variable.
+   */
+  def variable(name: String, dataType: DataType): VariableValue = {
+VariableValue(name, CodeGenerator.javaClass(dataType))
+  }
+
+  /**
+   * Create a local isNull variable.
+   */
+  def isNullVariable(name: String): VariableValue = variable(name, 
BooleanType)
+
+  /**
+   * Create a global java variable.
+   */
+  def global(name: String, dataType: DataType): GlobalValue = {
+GlobalValue(name, CodeGenerator.javaClass(dataType))
+  }
+
+  /**
+   * Create a global isNull variable.
+   */
+  def isNullGlobal(name: String): GlobalValue = global(name, BooleanType)
+
+  /**
+   * Create an expression fragment.
+   */
+  def expression(code: String, dataType: DataType): SimpleExprValue = {
+SimpleExprValue(code, CodeGenerator.javaClass(dataType))
+  }
+
+  /**
+   * Create a isNull expression fragment.
+   */
+  def isNullExpression(code: String): SimpleExprValue = {
+expression(code, BooleanType)
+  }
+}
+
+/**
+ * A typed java fragment that must be a valid java expression.
+ */
+trait ExprValue extends JavaCode {
+  def javaType: Class[_]
+  def isPrimitive: Boolean = javaType.isPrimitive
--- End diff --

This PR got rid of the `canDirectAccess` property from @viirya 's base 
change. Should we add it back?
(and implement that property in concrete classes with an `override def` 
instead of an `override val`)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180575717
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
 ---
@@ -292,8 +292,7 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
  |$writeExpressions
""".stripMargin
 // `rowWriter` is declared as a class field, so we can access it 
directly in methods.
-ExprCode(code, FalseLiteral, StatementValue(s"$rowWriter.getRow()", 
"UnsafeRow",
-  canDirectAccess = true))
+ExprCode(code, FalseLiteral, SimpleExprValue(s"$rowWriter.getRow()", 
classOf[UnsafeRow]))
--- End diff --

Add a `JavaCode.expression(expr: String, javaClass: Class[_])` and use it 
here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180575964
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateSafeProjection.scala
 ---
@@ -76,7 +78,7 @@ object GenerateSafeProjection extends 
CodeGenerator[Seq[Expression], Projection]
  |final InternalRow $output = new $rowClass($values);
""".stripMargin
 
-ExprCode(code, FalseLiteral, VariableValue(output, "InternalRow"))
+ExprCode(code, FalseLiteral, VariableValue(output, 
classOf[InternalRow]))
--- End diff --

Use `JavaCode.variable(name, javaClass)` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21026: [SPARK-23951][SQL] Use actual java class instead ...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/21026#discussion_r180575803
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateSafeProjection.scala
 ---
@@ -128,7 +131,7 @@ object GenerateSafeProjection extends 
CodeGenerator[Seq[Expression], Projection]
   final MapData $output = new $mapClass(${keyConverter.value}, 
${valueConverter.value});
 """
 
-ExprCode(code, FalseLiteral, VariableValue(output, "MapData"))
+ExprCode(code, FalseLiteral, VariableValue(output, classOf[MapData]))
--- End diff --

Use `JavaCode.variable(name, javaClass)` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21016: [SPARK-23947][SQL] Add hashUTF8String convenience...

2018-04-09 Thread rednaxelafx

GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/21016

[SPARK-23947][SQL] Add hashUTF8String convenience method to hasher classes

## What changes were proposed in this pull request?

Add `hashUTF8String()` to the hasher classes to allow Spark SQL codegen to 
generate cleaner code for hashing `UTF8String`s. No change in behavior 
otherwise.

Although with the introduction of SPARK-10399, the code size for hashing 
`UTF8String` is already smaller, it's still good to extract a separate function 
in the hasher classes so that the generated code can stay clean.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark hashutf8

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21016.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21016


commit ead91a453fb6a3f92255d14d37c1d4a6beac22fd
Author: Kris Mok <kris.mok@...>
Date:   2018-04-09T22:30:03Z

SPARK-23947 - Add hashUTF8String convenience method to hasher classes




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19082: [SPARK-21870][SQL] Split aggregation code into sm...

2018-04-05 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/19082#discussion_r179367236
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -257,6 +259,78 @@ case class HashAggregateExec(
  """.stripMargin
   }
 
+  // Extracts all the input variable references for a given `aggExpr`. 
This result will be used
+  // to split aggregation into small functions.
+  private def getInputVariableReferences(
+  context: CodegenContext,
+  aggregateExpression: Expression,
+  subExprs: Map[Expression, SubExprEliminationState]): Set[(String, 
String)] = {
+// `argSet` collects all the pairs of variable names and their types, 
the first in the pair is
+// a type name and the second is a variable name.
+val argSet = mutable.Set[(String, String)]()
+val stack = mutable.Stack[Expression](aggregateExpression)
+while (stack.nonEmpty) {
+  stack.pop() match {
+case e if subExprs.contains(e) =>
+  val exprCode = subExprs(e)
+  if (CodegenContext.isJavaIdentifier(exprCode.value)) {
--- End diff --

Once we have @viirya 's https://github.com/apache/spark/pull/20043 merged 
we won't need the ugly `CodegenContext.isJavaIdentifier` hack any more >_<|||


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20779: [SPARK-23598][SQL] Make methods in BufferedRowIterator p...

2018-03-24 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20779
  
Sorry for the late comment. This PR itself is LGTM. I'd just like to make 
some side comments on why this bug is happening.

Janino uses a peculiar encoding of implementing bridge methods for extended 
accessibility from an inner class to members of its enclosing class. Here we're 
actually hitting a bug in Janino where it missed creating bridge methods on the 
enclosing class (`GeneratedClass...`) for `protected` members that it inherited 
from a base class (`BufferedRowIterator`).
I've seen this bug in Janino before, and I plan to fix it in Janino soon. 
Once it's fixed in Janino, we can safely use `protected` members such as 
`append` and `stopEarly` in nested classes within `GeneratedClass...` again. 
Would anybody be interested in switching these methods back to `protected` once 
it's fixed in Janino and Spark bumps the Janino dependency to that new version?

Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20870: [SPARK-23760][SQL] CodegenContext.withSubExprElimination...

2018-03-21 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20870
  
jenkins retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20870: [SPARK-23760][SQL] CodegenContext.withSubExprElim...

2018-03-20 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20870#discussion_r175986986
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -942,7 +940,7 @@ class CodegenContext {
   def subexpressionEliminationForWholeStageCodegen(expressions: 
Seq[Expression]): SubExprCodes = {
 // Create a clear EquivalentExpressions and SubExprEliminationState 
mapping
 val equivalentExpressions: EquivalentExpressions = new 
EquivalentExpressions
-val subExprEliminationExprs = mutable.HashMap.empty[Expression, 
SubExprEliminationState]
+val localSubExprEliminationExprs = mutable.HashMap.empty[Expression, 
SubExprEliminationState]
--- End diff --

This renaming isn't necessary for the fix per-se, but I'd like to piggyback 
it on this change so that it's clearer that we're not interfering with the 
current CSE state of this CodegenContext here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20870: [SPARK-23760][SQL]: CodegenContext.withSubExprEli...

2018-03-20 Thread rednaxelafx

GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/20870

[SPARK-23760][SQL]: CodegenContext.withSubExprEliminationExprs should 
save/restore CSE state correctly

## What changes were proposed in this pull request?

Fixed `CodegenContext.withSubExprEliminationExprs()` so that it 
saves/restores CSE state correctly.

## How was this patch tested?

Added new unit test to verify that the old CSE state is indeed saved and 
restored around the `withSubExprEliminationExprs()` call. Manually verified 
that this test fails without this patch.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark codegen-subexpr-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20870.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20870






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20753: [SPARK-23582][SQL] StaticInvoke should support interpret...

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20753
  
@hvanhovell @kiszk Yes, the old school JDK reflection does perform bytecode 
generation to speed it up, since the JDK1.4 era. But the advantage of fully 
optimized `MethodHandle` versus reflection `Method.invoke` is that the former 
performs privilege checks at `MethodHandle` creation time, whereas the latter 
is required to perform privilege checks at every invocation (and that's not 
covered by the generated bytecode). So theoretically the `MethodHandle` version 
can be faster.

But note that even in OpenJDK8, the HotSpot VM is still exploring ways to 
optimize `MethodHandle` performance, and it's somewhat picky about the shape of 
the `MethodHandle`s, so it's not like we will always get better performance 
just by sticking it in place of reflection. It'll need some tuning / 
trial-and-error. We don't have to get it right in the first cut.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20753: [SPARK-23582][SQL] StaticInvoke should support in...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20753#discussion_r173075682
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -133,8 +134,21 @@ case class StaticInvoke(
   override def nullable: Boolean = needNullCheck || returnNullable
   override def children: Seq[Expression] = arguments
 
-  override def eval(input: InternalRow): Any =
-throw new UnsupportedOperationException("Only code-generated 
evaluation is supported.")
+  override def eval(input: InternalRow): Any = {
+if (staticObject == null) {
--- End diff --

We don't need this check, for sure. See line 132,
```
val objectName = staticObject.getName.stripSuffix("$")
```
^^ this will run as a part of the constructor, which will throw an NPE if 
staticObject is `null`, so it's redundant to null check it here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20757: [SPARK-23595][SQL] ValidateExternalType should su...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20757#discussion_r173010942
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -1408,11 +1409,37 @@ case class ValidateExternalType(child: Expression, 
expected: DataType)
 
   override def dataType: DataType = 
RowEncoder.externalDataTypeForInput(expected)
 
-  override def eval(input: InternalRow): Any =
-throw new UnsupportedOperationException("Only code-generated 
evaluation is supported")
-
   private val errMsg = s" is not a valid external type for schema of 
${expected.simpleString}"
 
+  private lazy val checkType = expected match {
+case _: DecimalType =>
+  (value: Any) => {
+Seq(classOf[java.math.BigDecimal], classOf[scala.math.BigDecimal], 
classOf[Decimal])
+  .exists { x => value.getClass.isAssignableFrom(x) }
+  }
+case _: ArrayType =>
+  (value: Any) => {
+value.getClass.isAssignableFrom(classOf[Seq[_]]) || 
value.getClass.isArray
--- End diff --

For those curious:

In HotSpot, the straightforward interpreter/C1 implementation of 
`xxx.getClass().isArray()` path is actually something like:
```
// for getClass()
klazz = xxx._klass; // read the hidden klass pointer field from the object 
header
clazz = klazz._java_mirror; // read the java.lang.Class reference from the 
Klass
// for clazz.isArray(): go through JNI and call the native 
JVM_IsArrayClass() inside HotSpot
klazz1 = clazz->_klass;
result = klazz1->oop_is_array();
```
So a JNI native method call is involved and that's not really fast. But C2 
will optimize this into something similar to:
```
klazz = xxx._klass;
result = inlined klazz->oop_is_array();
```
So that's pretty fast. No need to load the `java.lang.Class`  (aka "Java 
Mirroe") reference anymore.

In the `xxx.isInstanceOf[Seq[_]]` case, again the interpreter version would 
go through a JNI native method call, whereas the C1/C2 versions will inline a 
fast path logic and do a quick comparison against a per-type cache. This fast 
path check has similar overhead to the C2 `isArray()` overhead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20757: [SPARK-23595][SQL] ValidateExternalType should su...

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20757#discussion_r173006281
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -1408,11 +1409,37 @@ case class ValidateExternalType(child: Expression, 
expected: DataType)
 
   override def dataType: DataType = 
RowEncoder.externalDataTypeForInput(expected)
 
-  override def eval(input: InternalRow): Any =
-throw new UnsupportedOperationException("Only code-generated 
evaluation is supported")
-
   private val errMsg = s" is not a valid external type for schema of 
${expected.simpleString}"
 
+  private lazy val checkType = expected match {
+case _: DecimalType =>
+  (value: Any) => {
+Seq(classOf[java.math.BigDecimal], classOf[scala.math.BigDecimal], 
classOf[Decimal])
+  .exists { x => value.getClass.isAssignableFrom(x) }
+  }
+case _: ArrayType =>
+  (value: Any) => {
+value.getClass.isAssignableFrom(classOf[Seq[_]]) || 
value.getClass.isArray
--- End diff --

Hi guys, sorry I'm late.

In your new code you're doing:
```diff
+case _: ArrayType =>
+  (value: Any) => {
+value.getClass.isArray || value.isInstanceOf[Seq[_]]
+  }
```
which is good. `xxx.getClass().isAssignableFrom(some_class_literal)` in the 
old version of this PR is actually backwards, it should have been 
`some_class_literal.isAssignableFrom(xxx.getClass())`, e.g.
```
scala> classOf[String].isAssignableFrom(classOf[Object])
res0: Boolean = false

scala> classOf[Object].isAssignableFrom(classOf[String])
res1: Boolean = true
```
and the latter is semantically the same as `xxx.isInstanceOf[some_class]`. 
`isInstanceOf[]` is guaranteed to be at least as fast as 
`some_class_literal.isAssignableFrom(xxx.getClass())`, and in general 
`isInstanceOf[]` is faster.

`xxx.getClass().isArray()` has a fixed overhead, whereas `isInstanceOf[]` 
can have a fast path slightly faster than the `isArray` and a slow path that 
can be much slower than `isArray`. So putting the `isArray` check first in your 
new code makes more sense to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20700: [SPARK-23546][SQL] Refactor stateless methods/values in ...

2018-03-02 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20700

Aha! Thanks @kiszk san for working on this! I really wanted the stateless
methods to be extracted so that I can use more utils without having to pass
around a `CodegenContext` for no good.

I was actually working on the exact same thing last week but got carried
away on other tasks so my patch is still WIP.
For those curious, here's my WIP patch:
https://github.com/rednaxelafx/apache-spark/commit/249fb93c5cacc35d53e6b0f972b139d27ef5d720

I was hoping we can cover the ones I extracted here:
https://github.com/rednaxelafx/apache-spark/commit/249fb93c5cacc35d53e6b0f972b139d27ef5d720#diff-8bcc5aea39c73d4bf38aef6f6951d42cR1201
Your PR and my version extracted mostly the same fields/methods, which is
great for me that I'll just help get this PR in instead of having to send mine
out.

The approach is different, though. I'd like to have a discussion on the
tradeoffs you had in mind when you picked your approach.
I initially did something similar, which is to extract stateless methods
into companion object methods and change the original one to delegate to the
companion one. But the old instance methods were a bad smell to me (passing in
a instance when it shouldn't need the instance), so since we're at it, I
thought why not just completely remove that bad smell once and for all.
So I deleted the original instance methods, and yes that made the diff huge
because all use sites of those fields/methods have to be touched to switch to
using the new version.

@kiszk WDYT? Were you intending to minimize the diff but still want to be
able to access the stateless util methods as standalone functions?

---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20645: SPARK-23472: Add defaultJavaOptions for drivers and exec...

2018-02-22 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20645
  
A quick question: after this change, `extraJavaOptions` is still able to 
cleanly override whatever's set in `defaultJavaOptions`, is that right?

^^ Just making sure I understood the intent correctly and not the other way 
around. There may well be the other side of administrative needs which is to 
"force options", e.g. force `-XX:-DisableExplicitGC` so that NIO direct memory 
doesn't get into trouble, but that'd be out of scope of this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20043
  
Hi @viirya, this PR is definitely onto the right direction. I'd like to 
review this PR as well and hopeful help polish it in for post-2.3. Thanks for 
your work!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20626
  
@hvanhovell Thanks a lot! You're absolutely right. Update the PR 
accordingly. It's passing tests in my local testing and hopefully it'll pass 
the Jenkins tests as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20626
  
Ah...I see, there are more places where they're statically referencing some 
variable but dynamically those variables would always be null. I'll update the 
PR later to fix those places as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20626
  
cc @cloud-fan @hvanhovell 
Note: this is for master and branch-2.3 post 2.3.0 release.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20626
  
So I was able to find quite a few cases where the `DUMMY` placeholder 
caught uses of the `value` field outside of appropriate null-checked regions. 
I'll check the individual cases and then update this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal