[GitHub] spark pull request #19920: [SPARK-21672][CORE] Remove SHS-specific applicati...

2017-12-07 Thread smurakozi
Github user smurakozi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19920#discussion_r155718836
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala ---
@@ -88,4 +90,8 @@ private[history] class HistoryPage(parent: HistoryServer) 
extends WebUIPage("")
   private def makePageLink(showIncomplete: Boolean): String = {
 UIUtils.prependBaseUri("/?" + "showIncomplete=" + showIncomplete)
   }
+
+  private def isApplicationCompleted(appInfo: ApplicationInfo): Boolean = {
--- End diff --

My first commit had this method in ApplicationInfo, called completed

@vanzin:
`+case class ApplicationInfo private[spark](
...
+def completed: Boolean = {`
Is this used anywhere?

I answered we have it in HistoryPage (forgot to mention usage in the test)

@vanzin: Can we move this logic there instead? That avoids adding a new 
visible property in a public object.

I moved it, noting the code duplication in a comment on PR. 
The question is which is better: a bit of duplication (used 3 and defined 2 
times, once in application and once in test code) or extending the public API? 
My gut feeling is that the small duplication is preferable in this case, 
but I'm a newbie in Spark development so I'm easy to convince that we have 
different preferences :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19894: [SPARK-22700][ML] Bucketizer.transform incorrectly drops...

2017-12-07 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/19894
  
LGTM. Good fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19904
  
@sethah To verify the memory issue, you can add one line test code against 
current master at here:

```
  val modelFutures = ...
  // Unpersist training data only when all models have trained
  Future.sequence[Model[_], Iterable](modelFutures)(implicitly, 
executionContext)
.onComplete { _ => trainingDataset.unpersist() } (executionContext)
  // Evaluate models in a Future that will calulate a metric and allow 
model to be cleaned up
  val foldMetricFutures = 
  // Wait for metrics to be calculated before unpersisting validation 
dataset
  val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, 
Duration.Inf))
  validationDataset.unpersist()

  //add test code here, fetch all models
  val models = modelFutures.map(_.value.get.get)

  foldMetrics
```
The test code I add here is **val models = 
modelFutures.map(_.value.get.get)** So it can prove that these models are still 
in memory, we can get them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19904: [SPARK-22707][ML] Optimize CrossValidator memory ...

2017-12-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19904#discussion_r155715665
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -146,25 +147,18 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
   val validationDataset = sparkSession.createDataFrame(validation, 
schema).cache()
   logDebug(s"Train split $splitIndex with multiple sets of 
parameters.")
 
+  val completeFitCount = new AtomicInteger(0)
--- End diff --

We hope to unpersist training dataset once all fitting finished. But your 
idea here will postpone the unpersist time until all fitting & evaluation done. 
and your code should have the same effect with:
```
val modelFutures = ...
val foldMetrics = modelFutures.map(ThreadUtils.awaitResult(_, Duration.Inf))
trainingDataset.unpersist()
validationDataset.unpersist()
```

and, about what you said:
> "Now, the unpersist operation will happen in one of the training threads, 
instead of asynchronously in its own thread. "

What's the possible effect or impact ? `trainingDataset.unpersist()` itself 
is a async method and won't block. So will it have some effect ? I think it can 
be put in any thread safely.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19852
  
**[Test build #84647 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84647/testReport)**
 for PR 19852 at commit 
[`f606951`](https://github.com/apache/spark/commit/f606951b2fe2f74a0aa95055e17e479499c82e03).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...

2017-12-07 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19852
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2Options

2017-12-07 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19921
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2O...

2017-12-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19921


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84641/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84641 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84641/testReport)**
 for PR 19813 at commit 
[`7fc3e04`](https://github.com/apache/spark/commit/7fc3e04b9894214e67227beeb14165610e4887d2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class ExprInputVar(exprCode: ExprCode, dataType: DataType, 
nullable: Boolean)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84642/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84642 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84642/testReport)**
 for PR 19813 at commit 
[`9443011`](https://github.com/apache/spark/commit/9443011978c32c611e950a6193f05aa666437f50).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class ExprInputVar(exprCode: ExprCode, dataType: DataType, 
nullable: Boolean)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-07 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r155711631
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile 
---
@@ -0,0 +1,47 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FROM openjdk:8-alpine
--- End diff --

@jason-dai Can you follow up on 
https://github.com/apache-spark-on-k8s/spark/issues/326 directly?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-12-07 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19621
  
maybe we could also change the test itself to make it more deterministic?
we could first create a new test dataset that avoid having frequency 
values, run it through the original implementation, then run through this new 
change to make sure they match.

@actuaryzhang do you have any thought on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19904: [SPARK-22707][ML] Optimize CrossValidator memory ...

2017-12-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19904#discussion_r155710913
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -146,25 +147,18 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
   val validationDataset = sparkSession.createDataFrame(validation, 
schema).cache()
   logDebug(s"Train split $splitIndex with multiple sets of 
parameters.")
 
+  val completeFitCount = new AtomicInteger(0)
--- End diff --

My understanding of Scala futures may be off here, but this seems to change 
the behavior to me. Now, the unpersist operation will happen in one of the 
training threads, instead of asynchronously in its own thread. I'm not sure how 
much of an effect that will have. 

Why can't you just put all the logic in one map statement like below:

scala
  // Fit models in a Future for training in parallel
  val modelFutures = epm.zipWithIndex.map { case (paramMap, paramIndex) 
=>
Future[Model[_]] {
  val model = est.fit(trainingDataset, 
paramMap).asInstanceOf[Model[_]]

  if (collectSubModelsParam) {
subModels.get(splitIndex)(paramIndex) = model
  }
  // TODO: duplicate evaluator to take extra params from input
  val metric = eval.evaluate(model.transform(validationDataset, 
paramMap))
  logDebug(s"Got metric $metric for model trained with $paramMap.")
  metric
} (executionContext)
  }

  // Unpersist training data only when all models have trained
  Future.sequence[Model[_], Iterable](modelFutures)(implicitly, 
executionContext)
.onComplete { _ => trainingDataset.unpersist() } (executionContext)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-07 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r155710752
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile 
---
@@ -0,0 +1,47 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FROM openjdk:8-alpine
--- End diff --

The docker files in this PR are for references only for now, and are by no 
means the official docker images moving forward. We will address concerns 
around the base image as part of SPARK-22647.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/19904
  
Can you share your test/results with us?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-07 Thread jason-dai
Github user jason-dai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r155710219
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile 
---
@@ -0,0 +1,47 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FROM openjdk:8-alpine
--- End diff --

I wonder if there are any follow-ups of this particular issue 
(apache-spark-on-k8s#326).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19525
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84644/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-07 Thread foxish
Github user foxish commented on the issue:

https://github.com/apache/spark/pull/19717
  
> Yes, I agree that we eventually should allow client mode including use 
cases that directly create SparkContext

Sorry, in your previous comment, it seemed like you were suggesting 
disabling it entirely and that had me concerned. As a separate issue, we may in 
future want to have a nicer error message for inaccessible in-cluster service 
(`kubernetes.default.svc`). That actually has no dependency on the driver 
service, or its creation - we can check for that and it should exist anyway. 
Disabling client mode as you did in the latest commit works for this particular 
scenario I think. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19525
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19525
  
**[Test build #84644 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84644/testReport)**
 for PR 19525 at commit 
[`33e08ee`](https://github.com/apache/spark/commit/33e08ee22f2e37f47da246ce197f699351a3bc1b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19920: [SPARK-21672][CORE] Remove SHS-specific applicati...

2017-12-07 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19920#discussion_r155708117
  
--- Diff: core/src/main/scala/org/apache/spark/status/api/v1/api.scala ---
@@ -24,27 +24,29 @@ import 
com.fasterxml.jackson.databind.annotation.JsonDeserialize
 
 import org.apache.spark.JobExecutionStatus
 
-class ApplicationInfo private[spark](
-val id: String,
-val name: String,
-val coresGranted: Option[Int],
-val maxCores: Option[Int],
-val coresPerExecutor: Option[Int],
-val memoryPerExecutorMB: Option[Int],
-val attempts: Seq[ApplicationAttemptInfo])
+case class ApplicationInfo private[spark](
+id: String,
+name: String,
+coresGranted: Option[Int],
+maxCores: Option[Int],
+coresPerExecutor: Option[Int],
+memoryPerExecutorMB: Option[Int],
+attempts: Seq[ApplicationAttemptInfo]) {
+
--- End diff --

The empty brace can be removed here. But I think it should be reserved if a 
new method `isCompleted` is added here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19920: [SPARK-21672][CORE] Remove SHS-specific applicati...

2017-12-07 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19920#discussion_r155708779
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala ---
@@ -88,4 +90,8 @@ private[history] class HistoryPage(parent: HistoryServer) 
extends WebUIPage("")
   private def makePageLink(showIncomplete: Boolean): String = {
 UIUtils.prependBaseUri("/?" + "showIncomplete=" + showIncomplete)
   }
+
+  private def isApplicationCompleted(appInfo: ApplicationInfo): Boolean = {
--- End diff --

The function `isApplicationCompleted` is defined 3 times in this patch.
I suggest adding a method `isCompleted` in class `ApplicationInfo`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19924: [SPARK-22187][SS][REVERT] Revert change in state ...

2017-12-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19924


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19924: [SPARK-22187][SS][REVERT] Revert change in state row for...

2017-12-07 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19924
  
Thanks. Merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-07 Thread liyinan926
Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/19717
  
> I think we should move the headless service creation into the backend 
code - anything essential for the backend to run shouldn't depend on the 
submission client/steps.

I agree service creation should be done by the scheduler backend code so to 
get rid of dependency on the submission client. 

> Keeping that possible will let us plug Jupyter/spark-shell (running 
in-cluster for now). Disabling it completely will create an unnecessary 
dependency on spark-submit which IMO is undesirable. We do want people to be 
able to programmatically construct a spark context that can point at a k8s 
cluster in their code I think. kubernetes.default.svc is the in-cluster 
kube-dns provisioned DNS address that should point to the API server - its 
availability is a good indicator that we're in a place that can address other 
pods - so, that can be used to detect when we don't want to let users try and 
fail client mode.

Yes, I agree that we eventually should allow client mode including use 
cases that directly create `SparkContext`. But until we have a solid 
well-tested solution, I think we should disable it for now. We can always 
revisit this once we have a good solution. Regarding `kubernetes.default.svc`, 
yes, it's a good indication. But again, the driver service must exist. Unless 
we change to have the backend create that service, this still won't work. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19884: [WIP][SPARK-22324][SQL][PYTHON] Upgrade Arrow to ...

2017-12-07 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19884#discussion_r155708588
  
--- Diff: pom.xml ---
@@ -185,7 +185,7 @@
 2.8
 1.8
 1.0.0
-0.4.0
+0.8.0-SNAPSHOT
--- End diff --

Can we download the snapshot from somewhere for our local tests?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19717
  
**[Test build #84646 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84646/testReport)**
 for PR 19717 at commit 
[`44c40b1`](https://github.com/apache/spark/commit/44c40b1b7d546e4d214aff78c282c768aa0e3d42).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19843
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19843
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84637/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-07 Thread foxish
Github user foxish commented on the issue:

https://github.com/apache/spark/pull/19717
  
@liyinan926:

> Actually given that in our implementations the executors talk to the 
driver through a Kubernetes headless service that is only created by the 
submission client. None of the two cases above would work without careful and 
tricky configuration hacks. 

I think we should move the headless service creation into the backend code 
- anything essential for the backend to run shouldn't depend on the submission 
client/steps.  cc/ @mccheah for comment

> So creating a SparkContext directly with a k8s url should not be allowed 
until we find a way to make it work. 

Keeping that possible will let us plug Jupyter/spark-shell (running 
in-cluster for now). Disabling it completely will create an unnecessary 
dependency on spark-submit which IMO is undesirable. We do want people to be 
able to programmatically construct a spark context that can point at a k8s 
cluster in their code I think. `kubernetes.default.svc` is the in-cluster 
kube-dns provisioned DNS address that should point to the API server - its 
availability is a good indicator that we're in a place that can address other 
pods - so, that can be used to detect when we don't want to let users try and 
fail client mode. 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19843
  
**[Test build #84637 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84637/testReport)**
 for PR 19843 at commit 
[`9629b31`](https://github.com/apache/spark/commit/9629b31e577a40f71d660d22f48b0b59489a3471).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-07 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r155707713
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile 
---
@@ -0,0 +1,47 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FROM openjdk:8-alpine
--- End diff --

The base image **openjdk:8-alpine** uses **musl libc** instead of 
**glibc**, so certain library (like Intel MKL for native BLAS support in mllib) 
cannot run on the container.  
There is also report on openjdk compatibility issue on alpine 
(https://github.com/apache-spark-on-k8s/spark/issues/326). 
Should we change to a base image that supports glibc?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-07 Thread liyinan926
Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/19717
  
@vanzin I implemented the enforcement in 
https://github.com/apache/spark/pull/19717/commits/44c40b1b7d546e4d214aff78c282c768aa0e3d42.
 PTAL. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19852
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19852
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84633/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19852
  
**[Test build #84633 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84633/testReport)**
 for PR 19852 at commit 
[`f606951`](https://github.com/apache/spark/commit/f606951b2fe2f74a0aa95055e17e479499c82e03).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14151
  
**[Test build #84645 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84645/testReport)**
 for PR 14151 at commit 
[`7e91020`](https://github.com/apache/spark/commit/7e9102040ee9d6817288b5712e4c3c353390fba1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-12-07 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r155706338
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -313,11 +313,16 @@ def text(self, paths):
 Each line in the text file is a new row in the resulting DataFrame.
 
 :param paths: string, or list of strings, for input path(s).
+:param wholetext: if true, read each file from input path(s) as a 
single row.
 
 >>> df = spark.read.text('python/test_support/sql/text-test.txt')
 >>> df.collect()
 [Row(value=u'hello'), Row(value=u'this')]
+>>> df = spark.read.text('python/test_support/sql/text-test.txt', 
wholetext=True)
+>>> df.collect()
+[Row(value=u'hello\nthis')]
--- End diff --

That would fail the test, I suppose. I can give that a try though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...

2017-12-07 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19769#discussion_r155706151
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -363,9 +370,25 @@ class ParquetFileFormat
   fileSplit.getLocations,
   null)
 
+  val sharedConf = broadcastedHadoopConf.value.value
+  // PARQUET_INT96_TIMESTAMP_CONVERSION says to apply timezone 
conversions to int96 timestamps'
+  // *only* if the file was created by something other than 
"parquet-mr", so check the actual
+  // writer here for this file.  We have to do this per-file, as each 
file in the table may
+  // have different writers.
+  def isCreatedByParquetMr(): Boolean = {
--- End diff --

Ah, that makes sense. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...

2017-12-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19769#discussion_r155705887
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -363,9 +370,25 @@ class ParquetFileFormat
   fileSplit.getLocations,
   null)
 
+  val sharedConf = broadcastedHadoopConf.value.value
+  // PARQUET_INT96_TIMESTAMP_CONVERSION says to apply timezone 
conversions to int96 timestamps'
+  // *only* if the file was created by something other than 
"parquet-mr", so check the actual
+  // writer here for this file.  We have to do this per-file, as each 
file in the table may
+  // have different writers.
+  def isCreatedByParquetMr(): Boolean = {
--- End diff --

I was thinking the same thing but then I realised this is intendedly `def` 
for short-circuiting for `timestampConversion && !isCreatedByParquetMr()` in 
L383.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-07 Thread liyinan926
Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/19717
  
Checking the presence of `kubernetes.default.svc` is not sufficient as the 
driver service must exist. When the submission client is not involved, the 
driver itself must create that service before requesting the executor pods. I 
think we can go with a env var flag as @vanzin suggested.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19926: [SPARK-22733] Split StreamExecution into MicroBatchExecu...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19926
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84635/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19926: [SPARK-22733] Split StreamExecution into MicroBatchExecu...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19926
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19769
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84634/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19769
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19769
  
**[Test build #84634 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84634/testReport)**
 for PR 19769 at commit 
[`1185280`](https://github.com/apache/spark/commit/11852808e9fab0efa0ca7a4548e8ea68c42b6266).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19926: [SPARK-22733] Split StreamExecution into MicroBatchExecu...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19926
  
**[Test build #84635 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84635/testReport)**
 for PR 19926 at commit 
[`9f4daf6`](https://github.com/apache/spark/commit/9f4daf6b3eaeb413b803c3d0ec2ce0f43dc556a5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...

2017-12-07 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19769#discussion_r155704230
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
 ---
@@ -105,10 +112,19 @@
*/
   private final MemoryMode MEMORY_MODE;
 
-  public VectorizedParquetRecordReader(boolean useOffHeap) {
+  public VectorizedParquetRecordReader(TimeZone convertTz, boolean 
useOffHeap) {
+this.convertTz = convertTz;
 MEMORY_MODE = useOffHeap ? MemoryMode.OFF_HEAP : MemoryMode.ON_HEAP;
   }
 
+  public VectorizedParquetRecordReader(boolean useOffHeap) {
+this(null, useOffHeap);
+  }
+
+  VectorizedParquetRecordReader() {
--- End diff --

What is this constructor for?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...

2017-12-07 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19769#discussion_r155703789
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -363,9 +370,25 @@ class ParquetFileFormat
   fileSplit.getLocations,
   null)
 
+  val sharedConf = broadcastedHadoopConf.value.value
+  // PARQUET_INT96_TIMESTAMP_CONVERSION says to apply timezone 
conversions to int96 timestamps'
+  // *only* if the file was created by something other than 
"parquet-mr", so check the actual
+  // writer here for this file.  We have to do this per-file, as each 
file in the table may
+  // have different writers.
+  def isCreatedByParquetMr(): Boolean = {
--- End diff --

nit: Do we need to define this as `def` instead of `val`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...

2017-12-07 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19769#discussion_r155703855
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -363,9 +370,25 @@ class ParquetFileFormat
   fileSplit.getLocations,
   null)
 
+  val sharedConf = broadcastedHadoopConf.value.value
+  // PARQUET_INT96_TIMESTAMP_CONVERSION says to apply timezone 
conversions to int96 timestamps'
+  // *only* if the file was created by something other than 
"parquet-mr", so check the actual
+  // writer here for this file.  We have to do this per-file, as each 
file in the table may
+  // have different writers.
+  def isCreatedByParquetMr(): Boolean = {
+val footer = ParquetFileReader.readFooter(sharedConf, 
fileSplit.getPath, SKIP_ROW_GROUPS)
+footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr")
+  }
+  val convertTz =
+if (timestampConversion && !isCreatedByParquetMr()) {
+  
Some(DateTimeUtils.getTimeZone(sharedConf.get(SQLConf.SESSION_LOCAL_TIMEZONE.key)))
+} else {
+  None
+}
+
   val attemptId = new TaskAttemptID(new TaskID(new JobID(), 
TaskType.MAP, 0), 0)
   val hadoopAttemptContext =
-new TaskAttemptContextImpl(broadcastedHadoopConf.value.value, 
attemptId)
+new TaskAttemptContextImpl(broadcastedHadoopConf.value.value, 
attemptId);
--- End diff --

nit: we don't need `;` at the end of this line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...

2017-12-07 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19769#discussion_r155704317
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -307,6 +308,10 @@ class ParquetFileFormat
 hadoopConf.set(
   ParquetWriteSupport.SPARK_ROW_SCHEMA,
   requiredSchema.json)
+hadoopConf.set(
+  SQLConf.SESSION_LOCAL_TIMEZONE.key,
+  sparkSession.sessionState.conf.sessionLocalTimeZone
+)
--- End diff --

nit: we can join this line to the previous line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19525
  
**[Test build #84644 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84644/testReport)**
 for PR 19525 at commit 
[`33e08ee`](https://github.com/apache/spark/commit/33e08ee22f2e37f47da246ce197f699351a3bc1b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19811
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84638/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19811
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84638 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84638/testReport)**
 for PR 19811 at commit 
[`2ea1d35`](https://github.com/apache/spark/commit/2ea1d35972b8b6a5b6ca7a74bf7e8ceccc857536).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19852
  
**[Test build #84643 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84643/testReport)**
 for PR 19852 at commit 
[`f606951`](https://github.com/apache/spark/commit/f606951b2fe2f74a0aa95055e17e479499c82e03).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...

2017-12-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19852
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...

2017-12-07 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/19872
  
@icexelloss I'm sorry for the late response.
Actually I tried to implement prototypes of Pandas UDAF with partial 
aggregation and combining existing aggregate functions, but they are still much 
complicated (https://github.com/ueshin/apache-spark/pull/2, 
https://github.com/ueshin/apache-spark/pull/3, 
https://github.com/ueshin/apache-spark/pull/4). I was thinking about easier way 
to achieve that but not yet.
I've not looked into this pr yet but I guess we can start this pr and pick 
some functionalities from my prototypes if needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19927: [WIP] OVR transform optimization

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19927
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84640/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19927: [WIP] OVR transform optimization

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19927
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19927: [WIP] OVR transform optimization

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19927
  
**[Test build #84640 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84640/testReport)**
 for PR 19927 at commit 
[`b95795f`](https://github.com/apache/spark/commit/b95795fefce3dcb0bc3b91cbf70f09d7747affd1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19852: [SPARK-22655][PySpark] Throw exception rather tha...

2017-12-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19852#discussion_r155701675
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -317,10 +317,6 @@ private[spark] abstract class BasePythonRunner[IN, 
OUT](
 logDebug("Exception thrown after task interruption", e)
 throw new 
TaskKilledException(context.getKillReason().getOrElse("unknown reason"))
 
-  case e: Exception if env.isStopped =>
--- End diff --

Note to myself: this was added in https://github.com/apache/spark/pull/2838


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19591: [SPARK-11035][core] Add in-process Spark app launcher.

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19591
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19591: [SPARK-11035][core] Add in-process Spark app launcher.

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19591
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84629/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19591: [SPARK-11035][core] Add in-process Spark app launcher.

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19591
  
**[Test build #84629 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84629/testReport)**
 for PR 19591 at commit 
[`8496024`](https://github.com/apache/spark/commit/84960244458045810f0066256fbbee2446a3071c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19927: [WIP] OVR transform optimization

2017-12-07 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/19927
  
test code:

```
import org.apache.spark.ml.classification._

val df = 
spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/spark/data/mllib/sample_multiclass_classification_data.txt")

val classifier = new 
LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
val ovr = new OneVsRest().setClassifier(classifier)

var df2 = df
(0 until 10).foreach{ i => df2 = df2.union(df2) }
df2 = df2.repartition(64)
df2.persist()
df2.count

Seq(3, 5, 7, 10, 20).foreach { k =>
val dfk = df2.withColumn("label", (rand()*k).cast("int"))
val ovrModelK = ovr.fit(dfk)
val start = System.nanoTime
(0 until 10).foreach { i => ovrModelK.transform(dfk).count }
val end = System.nanoTime
val duration = (end - start) / 1e9
println(s"numClasses $k, duration $duration")
}

```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84639/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #84639 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84639/testReport)**
 for PR 19904 at commit 
[`2cc7c28`](https://github.com/apache/spark/commit/2cc7c28f385009570536690d686f2843485942b2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84642 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84642/testReport)**
 for PR 19813 at commit 
[`9443011`](https://github.com/apache/spark/commit/9443011978c32c611e950a6193f05aa666437f50).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...

2017-12-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19852
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84641 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84641/testReport)**
 for PR 19813 at commit 
[`7fc3e04`](https://github.com/apache/spark/commit/7fc3e04b9894214e67227beeb14165610e4887d2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-12-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r155698248
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines util methods used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
+val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, 
inputVars)
+
+val subExprs = getSubExprInChildren(ctx, expr)
+val subExprCodes = getSubExprCodes(ctx, subExprs)
+val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, 
subExprCodes)
+
+val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr)
+val paramsFromRows = inputRows.distinct.filter(_ != null).map { row =>
+  (row, s"InternalRow $row")
+}
+
+val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + 
paramsFromRows.length
+// Maximum allowed parameter number for Java's method descriptor.
+if (paramsLength > 255) {
--- End diff --

Since we split functions out if the code length is large enough, I think we 
won't possibly hit this limit. This is more like a safety guard.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19923: [SPARK-22721] BytesToBytesMap peak memory not updated.

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19923
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19923: [SPARK-22721] BytesToBytesMap peak memory not updated.

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19923
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84627/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19923: [SPARK-22721] BytesToBytesMap peak memory not updated.

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19923
  
**[Test build #84627 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84627/testReport)**
 for PR 19923 at commit 
[`12de708`](https://github.com/apache/spark/commit/12de708a3ba894b9568be12402f561b767355acc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19927: [WIP] OVR transform optimization

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19927
  
**[Test build #84640 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84640/testReport)**
 for PR 19927 at commit 
[`b95795f`](https://github.com/apache/spark/commit/b95795fefce3dcb0bc3b91cbf70f09d7747affd1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18995: [SPARK-21787][SQL] Support for pushing down filters for ...

2017-12-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18995
  
Thank you so much, @cloud-fan !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19499: [SPARK-22279][SQL] Turn on spark.sql.hive.convertMetasto...

2017-12-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19499
  
Thank you so much, @gatorsmile !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19927: [WIP] OVR transform optimization

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19927
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19927: [WIP] OVR transform optimization

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19927
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84636/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19927: [WIP] OVR transform optimization

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19927
  
**[Test build #84636 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84636/testReport)**
 for PR 19927 at commit 
[`45f70c8`](https://github.com/apache/spark/commit/45f70c87822b7c4bd872c0729509eeed0a2f9441).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19904: [SPARK-22707][ML] Optimize CrossValidator memory ...

2017-12-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19904#discussion_r155695280
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -146,31 +146,34 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
   val validationDataset = sparkSession.createDataFrame(validation, 
schema).cache()
   logDebug(s"Train split $splitIndex with multiple sets of 
parameters.")
 
+  var completeFitCount = 0
+  val signal = new Object
   // Fit models in a Future for training in parallel
-  val modelFutures = epm.zipWithIndex.map { case (paramMap, 
paramIndex) =>
-Future[Model[_]] {
+  val foldMetricFutures = epm.zipWithIndex.map { case (paramMap, 
paramIndex) =>
+Future[Double] {
   val model = est.fit(trainingDataset, 
paramMap).asInstanceOf[Model[_]]
+  signal.synchronized {
+completeFitCount += 1
+signal.notify()
+  }
 
   if (collectSubModelsParam) {
 subModels.get(splitIndex)(paramIndex) = model
   }
-  model
-} (executionContext)
-  }
-
-  // Unpersist training data only when all models have trained
-  Future.sequence[Model[_], Iterable](modelFutures)(implicitly, 
executionContext)
-.onComplete { _ => trainingDataset.unpersist() } (executionContext)
-
-  // Evaluate models in a Future that will calulate a metric and allow 
model to be cleaned up
-  val foldMetricFutures = modelFutures.zip(epm).map { case 
(modelFuture, paramMap) =>
-modelFuture.map { model =>
   // TODO: duplicate evaluator to take extra params from input
   val metric = eval.evaluate(model.transform(validationDataset, 
paramMap))
   logDebug(s"Got metric $metric for model trained with $paramMap.")
   metric
 } (executionContext)
   }
+  Future {
+signal.synchronized {
+  while (completeFitCount < epm.length) {
--- End diff --

Ah! your idea is really good. We don't need any "wait thread" to do this 
unpersist. Your solution will be simpler.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #84639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84639/testReport)**
 for PR 19904 at commit 
[`2cc7c28`](https://github.com/apache/spark/commit/2cc7c28f385009570536690d686f2843485942b2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2Options

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19921
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2Options

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19921
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84626/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2Options

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19921
  
**[Test build #84626 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84626/testReport)**
 for PR 19921 at commit 
[`2d4c2b0`](https://github.com/apache/spark/commit/2d4c2b0bdf89badf25f2d1d98903125e48e7cd5c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19919: [SPARK-22727] spark.executor.instances's default value s...

2017-12-07 Thread liu-zhaokun
Github user liu-zhaokun commented on the issue:

https://github.com/apache/spark/pull/19919
  
@srowen 
If the default is not correct,I can fix it in this PR by the way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-07 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r155693977
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ConfigSupport.scala
 ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import java.util.regex.Pattern
+
+import scala.collection.JavaConverters._
+import scala.collection.immutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.v2.ConfigSupport
+
+private[sql] object DataSourceV2ConfigSupport extends Logging {
+
+  /**
+   * Helper method to propagate session configs with config key that 
matches at least one of the
+   * given prefixes to the corresponding data source options.
+   *
+   * @param cs the session config propagate help class
+   * @param source the data source format
+   * @param conf the session conf
+   * @return an immutable map that contains all the session configs that 
should be propagated to
+   * the data source.
+   */
+  def withSessionConfig(
+  cs: ConfigSupport,
+  source: String,
+  conf: SQLConf): immutable.Map[String, String] = {
+val prefixes = cs.getConfigPrefixes
+require(prefixes != null, "The config key-prefixes cann't be null.")
+val mapping = cs.getConfigMapping.asScala
+val validOptions = cs.getValidOptions
+require(validOptions != null, "The valid options list cann't be null.")
--- End diff --

double n


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-07 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r155693966
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ConfigSupport.scala
 ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import java.util.regex.Pattern
+
+import scala.collection.JavaConverters._
+import scala.collection.immutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.v2.ConfigSupport
+
+private[sql] object DataSourceV2ConfigSupport extends Logging {
+
+  /**
+   * Helper method to propagate session configs with config key that 
matches at least one of the
+   * given prefixes to the corresponding data source options.
+   *
+   * @param cs the session config propagate help class
+   * @param source the data source format
+   * @param conf the session conf
+   * @return an immutable map that contains all the session configs that 
should be propagated to
+   * the data source.
+   */
+  def withSessionConfig(
+  cs: ConfigSupport,
+  source: String,
+  conf: SQLConf): immutable.Map[String, String] = {
+val prefixes = cs.getConfigPrefixes
+require(prefixes != null, "The config key-prefixes cann't be null.")
--- End diff --

double n


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19919: [SPARK-22727] spark.executor.instances's default value s...

2017-12-07 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19919
  
Hm, I thought the default would appear in `.../config/package.scala`, but I 
don't see it. I'm not actually sure it defaults to 2 anywhere (?) . (@vanzin 
does this ring a bell?)  I don't think these changes should be made, but I'm 
also wondering if the docs are correct that the default is 2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84638 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84638/testReport)**
 for PR 19811 at commit 
[`2ea1d35`](https://github.com/apache/spark/commit/2ea1d35972b8b6a5b6ca7a74bf7e8ceccc857536).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19904: [SPARK-22707][ML] Optimize CrossValidator memory ...

2017-12-07 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19904#discussion_r155693144
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -146,31 +146,34 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
   val validationDataset = sparkSession.createDataFrame(validation, 
schema).cache()
   logDebug(s"Train split $splitIndex with multiple sets of 
parameters.")
 
+  var completeFitCount = 0
+  val signal = new Object
   // Fit models in a Future for training in parallel
-  val modelFutures = epm.zipWithIndex.map { case (paramMap, 
paramIndex) =>
-Future[Model[_]] {
+  val foldMetricFutures = epm.zipWithIndex.map { case (paramMap, 
paramIndex) =>
+Future[Double] {
   val model = est.fit(trainingDataset, 
paramMap).asInstanceOf[Model[_]]
+  signal.synchronized {
+completeFitCount += 1
+signal.notify()
+  }
 
   if (collectSubModelsParam) {
 subModels.get(splitIndex)(paramIndex) = model
   }
-  model
-} (executionContext)
-  }
-
-  // Unpersist training data only when all models have trained
-  Future.sequence[Model[_], Iterable](modelFutures)(implicitly, 
executionContext)
-.onComplete { _ => trainingDataset.unpersist() } (executionContext)
-
-  // Evaluate models in a Future that will calulate a metric and allow 
model to be cleaned up
-  val foldMetricFutures = modelFutures.zip(epm).map { case 
(modelFuture, paramMap) =>
-modelFuture.map { model =>
   // TODO: duplicate evaluator to take extra params from input
   val metric = eval.evaluate(model.transform(validationDataset, 
paramMap))
   logDebug(s"Got metric $metric for model trained with $paramMap.")
   metric
 } (executionContext)
   }
+  Future {
+signal.synchronized {
+  while (completeFitCount < epm.length) {
--- End diff --

Sorry I'm not too familiar with Futures in Scala. Is it save to create a 
blocking future like this, do you risk starving the thread pool? Can we just 
just an if statement in the `synchronized` block above? something like:

```
completeFitCount += 1
if (completeFitCount == epm.length) {
trainingDataset.unpersist()
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19843
  
**[Test build #84637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84637/testReport)**
 for PR 19843 at commit 
[`9629b31`](https://github.com/apache/spark/commit/9629b31e577a40f71d660d22f48b0b59489a3471).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155692778
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -359,7 +371,7 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   test("cbool > false") {
 validateEstimatedStats(
   Filter(GreaterThan(attrBool, Literal(false)), 
childStatsTestPlan(Seq(attrBool), 10L)),
-  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max 
= Some(true),
+  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max 
= Some(true),
--- End diff --

That may need special code path for boolean type, but IMHO I don't think it 
deserves the complexity.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19927: [WIP] OVR transform optimization

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19927
  
**[Test build #84636 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84636/testReport)**
 for PR 19927 at commit 
[`45f70c8`](https://github.com/apache/spark/commit/45f70c87822b7c4bd872c0729509eeed0a2f9441).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >