date:20171130

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84371/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Ref...

2017-11-30 Thread smurching

Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19758#discussion_r154284858
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/TreeSplitUtilsSuite.scala ---
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.tree.{CategoricalSplit, ContinuousSplit, Split}
+import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.mllib.tree.impurity.{Entropy, Impurity}
+import org.apache.spark.mllib.tree.model.ImpurityStats
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+
+/** Suite exercising helper methods for making split decisions during 
decision tree training. */
+class TreeSplitUtilsSuite
+  extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  /**
+   * Get a DTStatsAggregator for sufficient stat collection/impurity 
calculation populated
+   * with the data from the specified training points. Assumes a feature 
index of 0 and that
+   * all training points have the same weights (1.0).
+   */
+  private def getAggregator(
+  metadata: DecisionTreeMetadata,
+  values: Array[Int],
+  labels: Array[Double],
+  featureSplits: Array[Split]): DTStatsAggregator = {
+// Create stats aggregator
+val statsAggregator = new DTStatsAggregator(metadata, featureSubset = 
None)
+// Update parent impurity stats
+val featureIndex = 0
+val instanceWeights = Array.fill[Double](values.length)(1.0)
+AggUpdateUtils.updateParentImpurity(statsAggregator, indices = 
values.indices.toArray,
+  from = 0, to = values.length, instanceWeights, labels)
+// Update current aggregator's impurity stats
+values.zip(labels).foreach { case (value: Int, label: Double) =>
+  if (metadata.isUnordered(featureIndex)) {
+AggUpdateUtils.updateUnorderedFeature(statsAggregator, value, 
label,
+  featureIndex = featureIndex, featureIndexIdx = 0, featureSplits, 
instanceWeight = 1.0)
+  } else {
+AggUpdateUtils.updateOrderedFeature(statsAggregator, value, label, 
featureIndexIdx = 0,
+  instanceWeight = 1.0)
+  }
+}
+statsAggregator
+  }
+
+  /**
+   * Check that left/right impurities match what we'd expect for a split.
+   * @param labels Labels whose impurity information should be reflected 
in stats
+   * @param stats ImpurityStats object containing impurity info for the 
left/right sides of a split
+   */
+  private def validateImpurityStats(
+  impurity: Impurity,
+  labels: Array[Double],
+  stats: ImpurityStats,
+  expectedLeftStats: Array[Double],
+  expectedRightStats: Array[Double]): Unit = {
+// Compute impurity for our data points manually
+val numClasses = (labels.max + 1).toInt
+val fullImpurityStatsArray
+  = Array.tabulate[Double](numClasses)((label: Int) => labels.count(_ 
== label).toDouble)
+val fullImpurity = Entropy.calculate(fullImpurityStatsArray, 
labels.length)
+// Verify that impurity stats were computed correctly for split
+assert(stats.impurityCalculator.stats === fullImpurityStatsArray)
+assert(stats.impurity === fullImpurity)
+assert(stats.leftImpurityCalculator.stats === expectedLeftStats)
+assert(stats.rightImpurityCalculator.stats === expectedRightStats)
+assert(stats.valid)
+  }
+
+  /* * * * * * * * * * * Choosing Splits  * * * * * * * * * * */
+
+  test("chooseSplit: choose correct type of split (continuous split)") {
+// Construct (binned) continuous data
+val labels = Array(0.0, 0.0, 1.0)
+val values = Array(1, 2, 3)
+val featureIndex = 0
+// Get an array of continuous splits corresponding to values in our 
binned data
+val splits =

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84371 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84371/testReport)**
 for PR 19813 at commit 
[`aa3db2e`](https://github.com/apache/spark/commit/aa3db2edca66ab04ecb8fbd54750cbd46544eb1d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-11-30 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19840
  
I'm a little concerned about such changes, this may be misconfigured to 
introduce the discrepancy between driver python and executor python, at least 
we should honor this configuration "spark.executorEnv.PYSPARK_PYTHON" unless 
the executables of `PYSPARK_PYTHON` sent by driver is not existed. (Just my two 
cents)

ping @zjffdu , any thought?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-11-30 Thread yaooqinn

Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/19840
  
Yes, you are right. we should use same python executables. But the **same** 
might mean binary same not just same path


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19857
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84373/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19857
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19857
  
**[Test build #84373 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84373/testReport)**
 for PR 19857 at commit 
[`c6f2250`](https://github.com/apache/spark/commit/c6f225025a1ba002b6aa4ce83fb67dbe742395b1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19764: [SPARK-22539][SQL] Add second order for rangepartitioner...

2017-11-30 Thread caneGuy

Github user caneGuy commented on the issue:

https://github.com/apache/spark/pull/19764
  
@gczsjdy Added a simple example.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-11-30 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19840
  
Oh, I see. You're running in client mode. So this one `--conf 
spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python` is useless.

So I guess the behavior is expected. Because driver will honor 
`PYSPARK_PYTHON` env and ship it to executors. So the cluster will use same 
python executables.

With your above test, `/path/to/python` is different for driver and 
executors, will it bring in issues? Driver uses `PYSPARK_PYTHON` and executors 
uses `spark.executorEnv.PYSPARK_PYTHON` which points to different paths.

Normally I think we don't need to set PYSPARK_PYTHON in executor side.

Please correct me if I'm wrong.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19857
  
**[Test build #84373 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84373/testReport)**
 for PR 19857 at commit 
[`c6f2250`](https://github.com/apache/spark/commit/c6f225025a1ba002b6aa4ce83fb67dbe742395b1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84370/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84370 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84370/testReport)**
 for PR 19813 at commit 
[`98850a6`](https://github.com/apache/spark/commit/98850a67c99b02b9ba98ec544f7d0d5671142716).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19824: [SPARK][STREAMING] Invoke onBatchCompletion() only when ...

2017-11-30 Thread CodingCat

Github user CodingCat commented on the issue:

https://github.com/apache/spark/pull/19824
  
`What I want to say is that if a Job is failed, we should consider the 
Batch as not completed.` isn't #16542 doing the same thing?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19857
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84372/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19857
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19857
  
**[Test build #84372 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84372/testReport)**
 for PR 19857 at commit 
[`980c8ec`](https://github.com/apache/spark/commit/980c8ec87ddbc9f938942e78bb4cfe9753722bd2).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19857
  
**[Test build #84372 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84372/testReport)**
 for PR 19857 at commit 
[`980c8ec`](https://github.com/apache/spark/commit/980c8ec87ddbc9f938942e78bb4cfe9753722bd2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-11-30 Thread yaooqinn

Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/19840
  
https://user-images.githubusercontent.com/8326978/33471349-e570953e-d6a7-11e7-9fec-74963efe37d2.png;>

@jerryshao ENVs are specified ok by yarn, but the `pythonExec` is generated 
in `PythonRDD` at driver side and delivered to executor side within a closure


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...

2017-11-30 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19857
  
@MrBago @jkbradley I think this PR need to be reviewed and merged first, 
before reviewing #19627 
Because this PR change some critical code path.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19857: [SPARK-22667][ML] Fix model-specific optimization...

2017-11-30 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/19857

[SPARK-22667][ML] Fix model-specific optimization support for ML tuning: 
Python API

## What changes were proposed in this pull request?

Python CrossValidator/TrainValidationSplit:
With base Estimator implemented in Scala/Java
â Convert base Estimator to Scala/Java object, and call the JVM fit() (as 
in Weichenâs comment)
With base Estimator implemented in Python
â Python needs the same machinery for multi-model fitting and parallelism 
as Scala.  We can call directly into it. New API added:
```
class Estimator:
  def parallelFit(self, dataset, paramMaps, threadPool, modelCallback):
```

**Note** This PR also fix the `# TODO: persist average metrics as well` in 
CV/TVS. Because the testsuite need to check consistency of `avgMetrics` so this 
need to be fixed.
If this need backport to old spark version, I can split it to a separate PR.

## How was this patch tested?

Existing UT already covers each code paths which need test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark fix_model_spec_optim_py

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19857.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19857


commit 980c8ec87ddbc9f938942e78bb4cfe9753722bd2
Author: WeichenXu 
Date:   2017-11-30T10:08:55Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-11-30 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19717
  
@vanzin @mridulm @jerryshao would love your comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-11-30 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19717
  
will do a pass on the latest over the weekend


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-11-30 Thread liyinan926

Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/19717
  
/cc @felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19054
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19054
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84368/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19054
  
**[Test build #84368 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84368/testReport)**
 for PR 19054 at commit 
[`69e288e`](https://github.com/apache/spark/commit/69e288efda8dbaab667f0e9ed6c7f2cb811f79b1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-11-30 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19840
  
I think in YARN we have several different ways to set `PYSPARK_PYTHON`, I 
guess your issue is that which one should take priority?

Can you please:

1. Define a consistent ordering for such envs, which one should take 
priority (spark.yarn.appMasterEnv.XXX or XXX), and document it.
2. Check if it works as expected for 
`spark.yarn.appMasterEnv.PYSPARK_PYTHON` and `spark.executorEnv.PYSPARK_PYTHON`.

I guess for other envs, it will also suffer from this problem, am I right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84367/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84367 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84367/testReport)**
 for PR 19813 at commit 
[`0d358d6`](https://github.com/apache/spark/commit/0d358d635494199582aa6e38fdbeec0f6446c029).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18324: [SPARK-21045][PYSPARK]Fixed executor blocked beca...

2017-11-30 Thread dataknocker

Github user dataknocker commented on a diff in the pull request:

https://github.com/apache/spark/pull/18324#discussion_r154274837
  
--- Diff: python/pyspark/worker.py ---
@@ -177,8 +180,11 @@ def process():
 process()
 except Exception:
 try:
+exc_info = traceback.format_exc()
+if isinstance(exc_info, unicode):
+exc_info = exc_info.encode('utf-8')
--- End diff --

@HyukjinKwon this pr can only be hanging?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...

2017-11-30 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19856
  
>I think the log can't reflect the behavior of consumer connection,because 
consumer.create doesn't do any connect,it only construct a 
ZookeeperConsumerConnector instance

That's not true, `ZookeeperConsumerConnector` will also connect to ZK 
during initialization, please see here 
(https://github.com/apache/kafka/blob/c9f930e3fe25e5675e64550c8b396356c5349ca7/core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala#L126).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...

2017-11-30 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19856
  
Actually there's no issue here, IMHO I think your understanding of this log 
is slightly different from the original purpose.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...

2017-11-30 Thread liu-zhaokun

Github user liu-zhaokun commented on the issue:

https://github.com/apache/spark/pull/19856
  
@jerryshao 
I think the log can't reflect the  behavior of consumer connection,because 
consumer.create doesn't do any connect,it only construct a 
ZookeeperConsumerConnector instance,so the right position is where zkclient 
connect to zk. ReliableKafkaReceiver is not marked deprecated,so users will use 
it.To avoid misleading by this log when a problem of connect to zk occur ,I 
think we should fix it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

2017-11-30 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18692#discussion_r154272738
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -152,3 +153,62 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
   if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
   }
 }
+
+/**
+ * A rule that eliminates CROSS joins by inferring join conditions from 
propagated constraints.
+ *
+ * The optimization is applicable only to CROSS joins. For other join 
types, adding inferred join
--- End diff --

It sounds promising. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...

2017-11-30 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19856
  
I guess the original purpose of such log is to reflect the behavior of 
consumer connection. It is not super necessary to do such trivial change. Also 
`ReliableKafkaReceiver` is not recommended any more, let's keep as it is.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...

2017-11-30 Thread liu-zhaokun

Github user liu-zhaokun commented on the issue:

https://github.com/apache/spark/pull/19856
  
@srowen 
Please help merge this PR as it has passed all tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-11-30 Thread yaooqinn

Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/19840
  
i guess specifing `PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python` 
overwrites spark.executorEnv.PYSPARK_PYTHON by 
[context.py#L156](https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L156),
 where should use `setIfMissing`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154270654
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+for (i <- 0 until histogram.bins.length) {
+  if (value > histogram.bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 
1)) {
+if (value == histogram.bins(i + 1).lo) {
--- End diff --

just move this condition after the length check: 
```
if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1) && 
(value == histogram.bins(i + 1).lo))
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19783
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84365/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

2017-11-30 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/19831
  
Yes, I saw some of these tables in my cluster, but the user did not 
manually modify this parameter:
```
# Detailed Table Information
Databasedw  
Table   prod
Owner   bi  
Created TimeTue Nov 03 16:33:52 CST 2015
Last Access Thu Jan 01 08:00:00 CST 1970
Created By  Spark 2.2 or prior  
TypeEXTERNAL
Providerhive
Comment Product list
Table Properties[transient_lastDdlTime=1508260780, 
last_modified_time=1473154014, last_modified_by=bi]  
Statistics  26596461123 bytes, 0 rows   
Locationviewfs://cluster9/user/hive/warehouse/dw.db/prod
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormat
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Storage Properties  [serialization.format=1]
Partition Provider  Catalog 
Time taken: 1.241 seconds, Fetched 70 row(s)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19783
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19783
  
**[Test build #84365 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84365/testReport)**
 for PR 19783 at commit 
[`6e6c49b`](https://github.com/apache/spark/commit/6e6c49bdfef0071f18b6e9b607b7fc1bf5b49ff8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19845
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84369/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19845
  
**[Test build #84369 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84369/testReport)**
 for PR 19845 at commit 
[`0c7e537`](https://github.com/apache/spark/commit/0c7e5378a491f01b14df23fe56ae50a0f52971ee).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19845
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-30 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19631
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84371 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84371/testReport)**
 for PR 19813 at commit 
[`aa3db2e`](https://github.com/apache/spark/commit/aa3db2edca66ab04ecb8fbd54750cbd46544eb1d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19845
  
**[Test build #84369 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84369/testReport)**
 for PR 19845 at commit 
[`0c7e537`](https://github.com/apache/spark/commit/0c7e5378a491f01b14df23fe56ae50a0f52971ee).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84370 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84370/testReport)**
 for PR 19813 at commit 
[`98850a6`](https://github.com/apache/spark/commit/98850a67c99b02b9ba98ec544f7d0d5671142716).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-11-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154267815
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines APIs used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
+val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, 
inputVars)
+
+val subExprs = getSubExprInChildren(ctx, expr)
+val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs)
+
+val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr)
+val paramsFromRows = inputRows.distinct.filter(_ != null).map { row =>
+  (row, s"InternalRow $row")
+}
+
+val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + 
paramsFromRows.length
+// Maximum allowed parameter number for Java's method descriptor.
+if (paramsLength > 255) {
+  None
+} else {
+  val allParams = (paramsFromRows ++ paramsFromColumns ++ 
paramsFromSubExprs).unzip
+  val callParams = allParams._1.distinct
+  val declParams = allParams._2.distinct
+  Some((callParams, declParams))
+}
+  }
+
+  /**
+   * Returns the eliminated subexpressions in the children expressions.
+   */
+  def getSubExprInChildren(ctx: CodegenContext, expr: Expression): 
Seq[Expression] = {
+expr.children.flatMap { child =>
+  child.collect {
+case e if ctx.subExprEliminationExprs.contains(e) => e
+  }
+}.distinct
+  }
+
+  /**
+   * Given the list of eliminated subexpressions used in the children 
expressions, returns the
+   * strings of funtion parameters. The first is the variable names used 
to call the function,
+   * the second is the parameters used to declare the function in 
generated code.
+   */
+  def getParamsForSubExprs(
+  ctx: CodegenContext,
+  subExprs: Seq[Expression]): Seq[(String, String)] = {
+subExprs.flatMap { subExpr =>
+  val argType = ctx.javaType(subExpr.dataType)
+
+  val subExprState = ctx.subExprEliminationExprs(subExpr)
+
+  if (!subExpr.nullable || subExprState.isNull == "true" || 
subExprState.isNull == "false") {
+Seq((subExprState.value, s"$argType ${subExprState.value}"))
+  } else {
+Seq((subExprState.value, s"$argType ${subExprState.value}"),
+  (subExprState.isNull, s"boolean ${subExprState.isNull}"))
+  }
+}.distinct
+  }
+
+  /**
+   * Retrieves previous input rows referred by children and deferred 
expressions.
+   */
+  def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): 
Seq[String] = {
+expr.children.flatMap(getInputRows(ctx, _)).distinct
+  }
+
+  /**

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-11-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154267761
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1028,12 +1053,18 @@ class CodegenContext {
   //   2. Less code.
   // Currently, we will do this for all non-leaf only expression trees 
(i.e. expr trees with
   // at least two nodes) as the cost of doing it is expected to be low.
-  addMutableState(JAVA_BOOLEAN, isNull, s"$isNull = false;")
+  if (expr.nullable) {
+addMutableState(JAVA_BOOLEAN, isNull)
+  }
   addMutableState(javaType(expr.dataType), value,
 s"$value = ${defaultValue(expr.dataType)};")
--- End diff --

Yeah, I think so.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19823: [SPARK-22601][SQL] Data load is getting displayed...

2017-11-30 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19823


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...

2017-11-30 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19823
  
Thanks! Merged to master/2.2


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-11-30 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154266910
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1028,12 +1053,18 @@ class CodegenContext {
   //   2. Less code.
   // Currently, we will do this for all non-leaf only expression trees 
(i.e. expr trees with
   // at least two nodes) as the cost of doing it is expected to be low.
-  addMutableState(JAVA_BOOLEAN, isNull, s"$isNull = false;")
+  if (expr.nullable) {
+addMutableState(JAVA_BOOLEAN, isNull)
+  }
   addMutableState(javaType(expr.dataType), value,
 s"$value = ${defaultValue(expr.dataType)};")
--- End diff --

Is this initialization `s"$value = ${defaultValue(expr.dataType)};"` 
necessary?  
I think `$value` will be always referred after `$value` is defined by an 
assignment generated at line 1039.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-11-30 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154266575
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines APIs used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
+val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, 
inputVars)
+
+val subExprs = getSubExprInChildren(ctx, expr)
+val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs)
+
+val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr)
+val paramsFromRows = inputRows.distinct.filter(_ != null).map { row =>
+  (row, s"InternalRow $row")
+}
+
+val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + 
paramsFromRows.length
+// Maximum allowed parameter number for Java's method descriptor.
+if (paramsLength > 255) {
+  None
+} else {
+  val allParams = (paramsFromRows ++ paramsFromColumns ++ 
paramsFromSubExprs).unzip
+  val callParams = allParams._1.distinct
+  val declParams = allParams._2.distinct
+  Some((callParams, declParams))
+}
+  }
+
+  /**
+   * Returns the eliminated subexpressions in the children expressions.
+   */
+  def getSubExprInChildren(ctx: CodegenContext, expr: Expression): 
Seq[Expression] = {
+expr.children.flatMap { child =>
+  child.collect {
+case e if ctx.subExprEliminationExprs.contains(e) => e
+  }
+}.distinct
+  }
+
+  /**
+   * Given the list of eliminated subexpressions used in the children 
expressions, returns the
+   * strings of funtion parameters. The first is the variable names used 
to call the function,
+   * the second is the parameters used to declare the function in 
generated code.
+   */
+  def getParamsForSubExprs(
+  ctx: CodegenContext,
+  subExprs: Seq[Expression]): Seq[(String, String)] = {
+subExprs.flatMap { subExpr =>
+  val argType = ctx.javaType(subExpr.dataType)
+
+  val subExprState = ctx.subExprEliminationExprs(subExpr)
+
+  if (!subExpr.nullable || subExprState.isNull == "true" || 
subExprState.isNull == "false") {
+Seq((subExprState.value, s"$argType ${subExprState.value}"))
+  } else {
+Seq((subExprState.value, s"$argType ${subExprState.value}"),
+  (subExprState.isNull, s"boolean ${subExprState.isNull}"))
+  }
+}.distinct
+  }
+
+  /**
+   * Retrieves previous input rows referred by children and deferred 
expressions.
+   */
+  def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): 
Seq[String] = {
+expr.children.flatMap(getInputRows(ctx, _)).distinct
+  }
+
+  /**
+

[GitHub] spark pull request #19854: SPARK-22660:Use position() and limit() to fix amb...

2017-11-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19854#discussion_r154266377
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala
 ---
@@ -424,7 +426,9 @@ case class HiveScriptIOSchema (
 recordReaderClass.map { klass =>
   val instance = 
Utils.classForName(klass).newInstance().asInstanceOf[RecordReader]
   val props = new Properties()
-  props.putAll(outputSerdeProps.toMap.asJava)
+  // props.putAll(outputSerdeProps.toMap.asJava)
+  // see https://github.com/apache/kafka/pull/3647
--- End diff --

ditto. I think it is better to remove commented code and refer to 
https://github.com/scala/bug/issues/10418.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

2017-11-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19845#discussion_r154266335
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1837,6 +1837,29 @@ def test_read_images(self):
 self.assertEqual(ImageSchema.undefinedImageType, "Undefined")
 
 
+class ImageReaderTest2(PySparkTestCase):
+
+@classmethod
+def setUpClass(cls):
+PySparkTestCase.setUpClass()
+try:
+cls.sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
+except py4j.protocol.Py4JError:
+cls.tearDownClass()
+raise unittest.SkipTest("Hive is not available")
+except TypeError:
+cls.tearDownClass()
+raise unittest.SkipTest("Hive is not available")
+cls.spark = HiveContext._createForTesting(cls.sc)
+
--- End diff --

Sure, that should be safer.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19854: SPARK-22660:Use position() and limit() to fix amb...

2017-11-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19854#discussion_r154266308
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala
 ---
@@ -412,7 +412,9 @@ case class HiveScriptIOSchema (
 propsMap = propsMap + (serdeConstants.LIST_COLUMN_TYPES -> 
columnTypesNames)
 
 val properties = new Properties()
-properties.putAll(propsMap.asJava)
+// properties.putAll(propsMap.asJava)
+// see https://github.com/apache/kafka/pull/3647
+propsMap.foreach{ case (k, v) => properties.put(k, v) }
--- End diff --

And we should not leave commented code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19811
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84364/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19811
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19854: SPARK-22660:Use position() and limit() to fix amb...

2017-11-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19854#discussion_r154266187
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
 ---
@@ -296,7 +296,8 @@ class KafkaTestUtils(withBrokerProps: Map[String, 
Object] = Map.empty) extends L
 props.put("replica.socket.timeout.ms", "1500")
 props.put("delete.topic.enable", "true")
 props.put("offsets.topic.num.partitions", "1")
-props.putAll(withBrokerProps.asJava)
+// props.putAll(withBrokerProps.asJava)
+withBrokerProps.foreach{ case (k, v) => props.put(k, v) }
--- End diff --

https://github.com/scala/bug/issues/10418. I see.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19854: SPARK-22660:Use position() and limit() to fix amb...

2017-11-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19854#discussion_r154266242
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala
 ---
@@ -412,7 +412,9 @@ case class HiveScriptIOSchema (
 propsMap = propsMap + (serdeConstants.LIST_COLUMN_TYPES -> 
columnTypesNames)
 
 val properties = new Properties()
-properties.putAll(propsMap.asJava)
+// properties.putAll(propsMap.asJava)
+// see https://github.com/apache/kafka/pull/3647
+propsMap.foreach{ case (k, v) => properties.put(k, v) }
--- End diff --

I think the direct reference should be 
https://github.com/scala/bug/issues/10418. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84364 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84364/testReport)**
 for PR 19811 at commit 
[`38fd079`](https://github.com/apache/spark/commit/38fd079c8257457eda741f808e0d00cedc6fdf02).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19854: SPARK-22660:Use position() and limit() to fix amb...

2017-11-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19854#discussion_r154265720
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
 ---
@@ -296,7 +296,8 @@ class KafkaTestUtils(withBrokerProps: Map[String, 
Object] = Map.empty) extends L
 props.put("replica.socket.timeout.ms", "1500")
 props.put("delete.topic.enable", "true")
 props.put("offsets.topic.num.partitions", "1")
-props.putAll(withBrokerProps.asJava)
+// props.putAll(withBrokerProps.asJava)
+withBrokerProps.foreach{ case (k, v) => props.put(k, v) }
--- End diff --

Is this related?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...

2017-11-30 Thread sujith71955

Github user sujith71955 commented on the issue:

https://github.com/apache/spark/pull/19823
  
Thanks all for the review and guidance.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

2017-11-30 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19845
  
@HyukjinKwon Thanks. I forgot the Hive support is needed to test it. The 
added test looks good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

2017-11-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19845#discussion_r154262931
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1837,6 +1837,29 @@ def test_read_images(self):
 self.assertEqual(ImageSchema.undefinedImageType, "Undefined")
 
 
+class ImageReaderTest2(PySparkTestCase):
+
+@classmethod
+def setUpClass(cls):
+PySparkTestCase.setUpClass()
+try:
+cls.sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
+except py4j.protocol.Py4JError:
+cls.tearDownClass()
+raise unittest.SkipTest("Hive is not available")
+except TypeError:
+cls.tearDownClass()
+raise unittest.SkipTest("Hive is not available")
+cls.spark = HiveContext._createForTesting(cls.sc)
+
--- End diff --

Add classmethod `tearDownClass` to stop the `cls.spark`? I didn't see 
`HiveContextSQLTests` closes it anyway, maybe we can fix it too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19054
  
**[Test build #84368 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84368/testReport)**
 for PR 19054 at commit 
[`69e288e`](https://github.com/apache/spark/commit/69e288efda8dbaab667f0e9ed6c7f2cb811f79b1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19631
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19631
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84363/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19631
  
**[Test build #84363 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84363/testReport)**
 for PR 19631 at commit 
[`c752453`](https://github.com/apache/spark/commit/c752453b2a379f301d52692bfc639bb631520069).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84367 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84367/testReport)**
 for PR 19813 at commit 
[`0d358d6`](https://github.com/apache/spark/commit/0d358d635494199582aa6e38fdbeec0f6446c029).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-30 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
@mgaido91 Thanks for the review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-11-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154261964
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines APIs used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
--- End diff --

Moved.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19848
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19848
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84362/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19054
  
**[Test build #84366 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84366/testReport)**
 for PR 19054 at commit 
[`b0db6aa`](https://github.com/apache/spark/commit/b0db6aafd3cb172062220aeeae087be9ad5b99e5).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19054
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84366/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19848
  
**[Test build #84362 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84362/testReport)**
 for PR 19848 at commit 
[`92f9180`](https://github.com/apache/spark/commit/92f9180b0fea71ff3ae4aa3049e04d6f1e3167be).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...

2017-11-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19054
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19054
  
**[Test build #84366 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84366/testReport)**
 for PR 19054 at commit 
[`b0db6aa`](https://github.com/apache/spark/commit/b0db6aafd3cb172062220aeeae087be9ad5b99e5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-11-30 Thread yaooqinn

Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/19840
  
@ueshin cluster mode working, client not


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19850: [SPARK-22653] executorAddress registered in Coars...

2017-11-30 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19850


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19850: [SPARK-22653] executorAddress registered in CoarseGraine...

2017-11-30 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19850
  
good catch! merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-11-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154257537
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -115,9 +116,35 @@ abstract class Expression extends TreeNode[Expression] 
{
 }
   }
 
+  /**
+   * Records current input row and variables for this expression into 
created `ExprCode`.
+   */
+  private def populateInputs(ctx: CodegenContext, eval: ExprCode): Unit = {
+if (ctx.INPUT_ROW != null) {
+  eval.inputRow = ctx.INPUT_ROW
+}
+if (ctx.currentVars != null) {
+  val boundRefs = this.collect {
+case b @ BoundReference(ordinal, _, _) if ctx.currentVars(ordinal) 
!= null => (ordinal, b)
+  }.toMap
+
+  ctx.currentVars.zipWithIndex.filter(_._1 != null).foreach { case 
(currentVar, idx) =>
--- End diff --

ctx.INPUT_ROW can possibly be null.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-11-30 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154257472
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines APIs used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
+val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr)
+val subExprs = getSubExprInChildren(ctx, expr)
+
+val paramsFromRows = inputRows.distinct.filter(_ != null).map { row =>
+  (row, s"InternalRow $row")
+}
+val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, 
inputVars)
+val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs)
+val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + 
paramsFromRows.length
+
+// Maximum allowed parameter number for Java's method descriptor.
+if (paramsLength > 255) {
+  None
+} else {
+  val allParams = (paramsFromRows ++ paramsFromColumns ++ 
paramsFromSubExprs).unzip
+  val callParams = allParams._1.distinct
+  val declParams = allParams._2.distinct
+  Some((callParams, declParams))
+}
+  }
+
+  /**
+   * Returns the eliminated subexpressions in the children expressions.
+   */
+  def getSubExprInChildren(ctx: CodegenContext, expr: Expression): 
Seq[Expression] = {
+expr.children.flatMap { child =>
+  child.collect {
+case e if ctx.subExprEliminationExprs.contains(e) => e
+  }
+}.distinct
+  }
+
+  /**
+   * Given the list of eliminated subexpressions used in the children 
expressions, returns the
+   * strings of funtion parameters. The first is the variable names used 
to call the function,
+   * the second is the parameters used to declare the function in 
generated code.
+   */
+  def getParamsForSubExprs(
+  ctx: CodegenContext,
+  subExprs: Seq[Expression]): Seq[(String, String)] = {
+subExprs.flatMap { subExpr =>
+  val argType = ctx.javaType(subExpr.dataType)
+
+  val subExprState = ctx.subExprEliminationExprs(subExpr)
+  (subExprState.value, subExprState.isNull)
+
+  if (!subExpr.nullable || subExprState.isNull == "true" || 
subExprState.isNull == "false") {
+Seq((subExprState.value, s"$argType ${subExprState.value}"))
+  } else {
+Seq((subExprState.value, s"$argType ${subExprState.value}"),
+  (subExprState.isNull, s"boolean ${subExprState.isNull}"))
+  }
+}.distinct
+  }
+
+  /**
+   * Retrieves previous input rows referred by children and deferred 
expressions.
+   */
+  def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): 
Seq[String] = {
+

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

2017-11-30 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18692#discussion_r154256590
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -152,3 +153,62 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
   if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
   }
 }
+
+/**
+ * A rule that eliminates CROSS joins by inferring join conditions from 
propagated constraints.
+ *
+ * The optimization is applicable only to CROSS joins. For other join 
types, adding inferred join
--- End diff --

can we apply this optimization to all joins after 
https://github.com/apache/spark/pull/19054?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19783
  
**[Test build #84365 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84365/testReport)**
 for PR 19783 at commit 
[`6e6c49b`](https://github.com/apache/spark/commit/6e6c49bdfef0071f18b6e9b607b7fc1bf5b49ff8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19848: [SPARK-22162] Executors and the driver should use...

2017-11-30 Thread rezasafi

Github user rezasafi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19848#discussion_r154256205
  
--- Diff: 
core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala ---
@@ -70,7 +70,8 @@ object SparkHadoopMapRedUtil extends Logging {
   if (shouldCoordinateWithDriver) {
 val outputCommitCoordinator = SparkEnv.get.outputCommitCoordinator
 val taskAttemptNumber = TaskContext.get().attemptNumber()
-val canCommit = outputCommitCoordinator.canCommit(jobId, splitId, 
taskAttemptNumber)
+val stageId = TaskContext.get().stageId()
+val canCommit = outputCommitCoordinator.canCommit(stageId, 
splitId, taskAttemptNumber)
--- End diff --

 Removing jobId from the signature of commitTask will cause a binary 
incompatibility error, since commitTask here is a public method. So we will 
ended up with a parameter that will stay unused. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19850: [SPARK-22653] executorAddress registered in CoarseGraine...

2017-11-30 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/19850
  
Please update the PR title:
```
[SPARK-22653][CORE] executorAddress registered in 
CoarseGrainedSchedulerBackend.executorDataMap should not be null
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154255068
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -578,6 +590,112 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   expectedRowCount = 5)
   }
 
+  // The following test cases have histogram information collected for the 
test column
+  test("Not(cintHgm < 3 AND null)") {
+val condition = Not(And(LessThan(attrIntHgm, Literal(3)), 
Literal(null, IntegerType)))
+validateEstimatedStats(
+  Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> colStatIntHgm.copy(distinctCount = 6)),
+  expectedRowCount = 9)
+  }
+
+  test("cintHgm = 5") {
+validateEstimatedStats(
+  Filter(EqualTo(attrIntHgm, Literal(5)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(5), max = 
Some(5),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 4)
+  }
+
+  test("cintHgm = 0") {
+// This is an out-of-range case since 0 is outside the range [min, max]
+validateEstimatedStats(
+  Filter(EqualTo(attrIntHgm, Literal(0)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm < 3") {
+validateEstimatedStats(
+  Filter(LessThan(attrIntHgm, Literal(3)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = 
Some(3),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm < 0") {
+// This is a corner case since literal 0 is smaller than min.
+validateEstimatedStats(
+  Filter(LessThan(attrIntHgm, Literal(0)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm <= 3") {
+validateEstimatedStats(
+  Filter(LessThanOrEqual(attrIntHgm, Literal(3)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = 
Some(3),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm > 6") {
+validateEstimatedStats(
+  Filter(GreaterThan(attrIntHgm, Literal(6)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 2, min = Some(6), max = 
Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm > 10") {
+// This is a corner case since max value is 10.
+validateEstimatedStats(
+  Filter(GreaterThan(attrIntHgm, Literal(10)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm >= 6") {
+validateEstimatedStats(
+  Filter(GreaterThanOrEqual(attrIntHgm, Literal(6)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 3, min = Some(6), max = 
Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 4)
+  }
+
+  test("cintHgm IS NULL") {
+validateEstimatedStats(
+  Filter(IsNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm IS NOT NULL") {
+validateEstimatedStats(
+  Filter(IsNotNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 
10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 6, min = Some(1), max = 
Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 10)
+  }
+
+  test("cintHgm > 3 AND cintHgm <= 6") {
+val condition = And(GreaterThan(attrIntHgm,
+  Literal(3)), LessThanOrEqual(attrIntHgm, Literal(6)))
+validateEstimatedStats(
+  Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 5, min = Some(3), max = 
Some(6),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 8)
+  }
+
+  test("cintHgm = 3 OR cintHgm = 6") {
--- End diff --

We have added histogram test cases for skewed distribution.  I will add 
more histogram test cases for non-skewed distribution.


---

-
To unsubscribe,

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154254419
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -359,7 +371,7 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   test("cbool > false") {
 validateEstimatedStats(
   Filter(GreaterThan(attrBool, Literal(false)), 
childStatsTestPlan(Seq(attrBool), 10L)),
-  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max 
= Some(true),
+  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max 
= Some(true),
--- End diff --

My earlier comment mentioned this test case.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154254145
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -784,11 +879,16 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
   def outputColumnStats(rowsBeforeFilter: BigInt, rowsAfterFilter: BigInt)
 : AttributeMap[ColumnStat] = {
 val newColumnStats = originalMap.map { case (attr, oriColStat) =>
-  // Update ndv based on the overall filter selectivity: scale down 
ndv if the number of rows
-  // decreases; otherwise keep it unchanged.
-  val newNdv = EstimationUtils.updateNdv(oldNumRows = rowsBeforeFilter,
-newNumRows = rowsAfterFilter, oldNdv = oriColStat.distinctCount)
   val colStat = 
updatedMap.get(attr.exprId).map(_._2).getOrElse(oriColStat)
+  val newNdv = if (colStat.distinctCount > 1) {
--- End diff --

The old code does not work well for a couple of new skewed-distribution 
tests.  For example, test("cintHgm < 3") would fail.  Because it still computes 
to find newNdv in updateNdv() method.  But, in reality, we already scale it 
down to 1.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-11-30 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84364 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84364/testReport)**
 for PR 19811 at commit 
[`38fd079`](https://github.com/apache/spark/commit/38fd079c8257457eda741f808e0d00cedc6fdf02).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19824: [SPARK][STREAMING] Invoke onBatchCompletion() only when ...

2017-11-30 Thread victor-wong

Github user victor-wong commented on the issue:

https://github.com/apache/spark/pull/19824
  
@CodingCat Yes, this PR wants to solve the same issue in 
https://github.com/apache/spark/pull/16542, but I think this is a better way to 
solve it. 
If a Job failed, I think we should not remove it from its JobSet, so 
`jobSet.hasCompleted` will return false. As a result, we will not receive a 
StreamingListenerBatchCompleted.
What I want to say is that if a Job is failed, we should consider the Batch 
as not completed.
I am not confident about my English, if I am not describing it clear, 
please let me know.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154252063
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -513,10 +560,9 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 
 op match {
   case _: GreaterThan | _: GreaterThanOrEqual =>
-// If new ndv is 1, then new max must be equal to new min.
-newMin = if (newNdv == 1) newMax else newValue
+newMin = newValue
   case _: LessThan | _: LessThanOrEqual =>
-newMax = if (newNdv == 1) newMin else newValue
+newMax = newValue
--- End diff --

Previously I coded that way because of a corner test case: test("cbool > 
false").  At that time, I set the newMin to newMax since newNdv = 1.  However, 
this logic does not work well for the skewed distribution test case: test 
("cintHgm < 3").  In this test, newMin=1 newMax=3.  I think the revised code 
makes better sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 395 matches

Mail list logo