date:20170621

[GitHub] spark issue #18322: [SPARK-21115][Core]If the cores left is less than the co...

2017-06-21 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/18322
  
@jiangxb1987 can you please help to review this PR? This is a simple code 
improvement to avoid some unnecessary code execution when left cores is not 
enough for one executor. 

I don't strong inclination on this PR since previous code also does correct 
behavior, I'd like to hear your thoughts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17113
  
**[Test build #78431 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78431/testReport)**
 for PR 17113 at commit 
[`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...

2017-06-21 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/17113
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #9518: [SPARK-11574][Core] Add metrics StatsD sink

2017-06-21 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/9518#discussion_r123420182
  
--- Diff: 
core/src/main/scala/org/apache/spark/metrics/sink/StatsdReporter.scala ---
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.metrics.sink
+
+import java.io.IOException
+import java.net.{DatagramPacket, DatagramSocket, InetSocketAddress}
+import java.nio.charset.StandardCharsets.UTF_8
+import java.util.SortedMap
+import java.util.concurrent.TimeUnit
+
+import scala.collection.JavaConverters._
+import scala.util.{Failure, Success, Try}
+
+import com.codahale.metrics._
+import org.apache.hadoop.net.NetUtils
+
+import org.apache.spark.Logging
+
+/**
+ * @see https://github.com/etsy/statsd/blob/master/docs/metric_types.md;>
+ *StatsD metric types
+ */
+private[spark] sealed trait StatsdMetricType {
--- End diff --

This trait looks only used for defining some values, may be we could use 
`object StatsdReporter` instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #9518: [SPARK-11574][Core] Add metrics StatsD sink

2017-06-21 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/9518#discussion_r123425292
  
--- Diff: 
core/src/main/scala/org/apache/spark/metrics/sink/StatsdReporter.scala ---
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.metrics.sink
+
+import java.io.IOException
+import java.net.{DatagramPacket, DatagramSocket, InetSocketAddress}
+import java.nio.charset.StandardCharsets.UTF_8
+import java.util.SortedMap
+import java.util.concurrent.TimeUnit
+
+import scala.collection.JavaConverters._
+import scala.util.{Failure, Success, Try}
+
+import com.codahale.metrics._
+import org.apache.hadoop.net.NetUtils
+
+import org.apache.spark.Logging
+
+/**
+ * @see https://github.com/etsy/statsd/blob/master/docs/metric_types.md;>
+ *StatsD metric types
+ */
+private[spark] sealed trait StatsdMetricType {
+  val COUNTER = "c"
+  val GAUGE = "g"
+  val TIMER = "ms"
+  val Set = "s"
+}
+
+private[spark] class StatsdReporter(
+registry: MetricRegistry,
+host: String = "127.0.0.1",
+port: Int = 8125,
+prefix: String = "",
+filter: MetricFilter = MetricFilter.ALL,
+rateUnit: TimeUnit = TimeUnit.SECONDS,
+durationUnit: TimeUnit = TimeUnit.MILLISECONDS)
+  extends ScheduledReporter(registry, "statsd-reporter", filter, rateUnit, 
durationUnit)
+  with StatsdMetricType with Logging {
+
+  private val address = new InetSocketAddress(host, port)
+  private val whitespace = "[\\s]+".r
+
+  override def report(
+  gauges: SortedMap[String, Gauge[_]],
+  counters: SortedMap[String, Counter],
+  histograms: SortedMap[String, Histogram],
+  meters: SortedMap[String, Meter],
+  timers: SortedMap[String, Timer]): Unit =
+Try(new DatagramSocket) match {
+  case Failure(ioe: IOException) => logWarning("StatsD datagram socket 
construction failed",
+NetUtils.wrapException(host, port, "0.0.0.0", 0, ioe))
+  case Failure(e) => logWarning("StatsD datagram socket construction 
failed", e)
+  case Success(s) =>
+implicit val socket = s
+val localAddress = 
Try(socket.getLocalAddress).map(_.getHostAddress).getOrElse(null)
+val localPort = socket.getLocalPort
+Try {
+  gauges.entrySet.asScala.foreach(e => reportGauge(e.getKey, 
e.getValue))
+  counters.entrySet.asScala.foreach(e => reportCounter(e.getKey, 
e.getValue))
+  histograms.entrySet.asScala.foreach(e => 
reportHistogram(e.getKey, e.getValue))
+  meters.entrySet.asScala.foreach(e => reportMetered(e.getKey, 
e.getValue))
+  timers.entrySet.asScala.foreach(e => reportTimer(e.getKey, 
e.getValue))
+} recover {
+  case ioe: IOException =>
+logDebug(s"Unable to send packets to StatsD", 
NetUtils.wrapException(
+  address.getHostString, address.getPort, localAddress, 
localPort, ioe))
+  case e: Throwable => logDebug(s"Unable to send packets to StatsD 
at '$host:$port'", e)
+}
+Try(socket.close()) recover {
+  case ioe: IOException =>
+logDebug("Error when close socket to StatsD", 
NetUtils.wrapException(
+  address.getHostString, address.getPort, localAddress, 
localPort, ioe))
+  case e: Throwable => logDebug("Error when close socket to 
StatsD", e)
+}
+}
+
+  private def reportGauge(name: String, gauge: Gauge[_])(implicit socket: 
DatagramSocket) =
+formatAny(gauge.getValue).foreach(v => send(fullName(name), v, GAUGE))
+
+  private def reportCounter(name: String, counter: Counter)(implicit 
socket: DatagramSocket) =
+send(fullName(name), format(counter.getCount), COUNTER)
+
+  private def reportHistogram(name: String, histogram: Histogram)
+ (implicit

[GitHub] spark pull request #9518: [SPARK-11574][Core] Add metrics StatsD sink

2017-06-21 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/9518#discussion_r123424723
  
--- Diff: 
core/src/main/scala/org/apache/spark/metrics/sink/StatsdReporter.scala ---
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.metrics.sink
+
+import java.io.IOException
+import java.net.{DatagramPacket, DatagramSocket, InetSocketAddress}
+import java.nio.charset.StandardCharsets.UTF_8
+import java.util.SortedMap
+import java.util.concurrent.TimeUnit
+
+import scala.collection.JavaConverters._
+import scala.util.{Failure, Success, Try}
+
+import com.codahale.metrics._
+import org.apache.hadoop.net.NetUtils
+
+import org.apache.spark.Logging
+
+/**
+ * @see https://github.com/etsy/statsd/blob/master/docs/metric_types.md;>
+ *StatsD metric types
+ */
+private[spark] sealed trait StatsdMetricType {
+  val COUNTER = "c"
+  val GAUGE = "g"
+  val TIMER = "ms"
+  val Set = "s"
+}
+
+private[spark] class StatsdReporter(
+registry: MetricRegistry,
+host: String = "127.0.0.1",
+port: Int = 8125,
+prefix: String = "",
+filter: MetricFilter = MetricFilter.ALL,
+rateUnit: TimeUnit = TimeUnit.SECONDS,
+durationUnit: TimeUnit = TimeUnit.MILLISECONDS)
+  extends ScheduledReporter(registry, "statsd-reporter", filter, rateUnit, 
durationUnit)
+  with StatsdMetricType with Logging {
+
+  private val address = new InetSocketAddress(host, port)
+  private val whitespace = "[\\s]+".r
+
+  override def report(
+  gauges: SortedMap[String, Gauge[_]],
+  counters: SortedMap[String, Counter],
+  histograms: SortedMap[String, Histogram],
+  meters: SortedMap[String, Meter],
+  timers: SortedMap[String, Timer]): Unit =
+Try(new DatagramSocket) match {
+  case Failure(ioe: IOException) => logWarning("StatsD datagram socket 
construction failed",
+NetUtils.wrapException(host, port, "0.0.0.0", 0, ioe))
+  case Failure(e) => logWarning("StatsD datagram socket construction 
failed", e)
+  case Success(s) =>
+implicit val socket = s
+val localAddress = 
Try(socket.getLocalAddress).map(_.getHostAddress).getOrElse(null)
+val localPort = socket.getLocalPort
+Try {
+  gauges.entrySet.asScala.foreach(e => reportGauge(e.getKey, 
e.getValue))
+  counters.entrySet.asScala.foreach(e => reportCounter(e.getKey, 
e.getValue))
+  histograms.entrySet.asScala.foreach(e => 
reportHistogram(e.getKey, e.getValue))
+  meters.entrySet.asScala.foreach(e => reportMetered(e.getKey, 
e.getValue))
+  timers.entrySet.asScala.foreach(e => reportTimer(e.getKey, 
e.getValue))
+} recover {
+  case ioe: IOException =>
+logDebug(s"Unable to send packets to StatsD", 
NetUtils.wrapException(
+  address.getHostString, address.getPort, localAddress, 
localPort, ioe))
+  case e: Throwable => logDebug(s"Unable to send packets to StatsD 
at '$host:$port'", e)
+}
+Try(socket.close()) recover {
+  case ioe: IOException =>
+logDebug("Error when close socket to StatsD", 
NetUtils.wrapException(
+  address.getHostString, address.getPort, localAddress, 
localPort, ioe))
+  case e: Throwable => logDebug("Error when close socket to 
StatsD", e)
+}
+}
+
+  private def reportGauge(name: String, gauge: Gauge[_])(implicit socket: 
DatagramSocket) =
+formatAny(gauge.getValue).foreach(v => send(fullName(name), v, GAUGE))
+
+  private def reportCounter(name: String, counter: Counter)(implicit 
socket: DatagramSocket) =
+send(fullName(name), format(counter.getCount), COUNTER)
+
+  private def reportHistogram(name: String, histogram: Histogram)
+ (implicit

[GitHub] spark issue #18371: [SPARK-20889][SparkR] Grouped documentation for MATH col...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18371
  
**[Test build #78430 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78430/testReport)**
 for PR 18371 at commit 
[`707b871`](https://github.com/apache/spark/commit/707b871160574297ef8eb75859d05d9ab13df02c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...

2017-06-21 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18378#discussion_r123425404
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1721,7 +1721,14 @@ def toPandas(self):
 15Bob
 """
 import pandas as pd
-return pd.DataFrame.from_records(self.collect(), 
columns=self.columns)
+
+dtype = {}
+for field in self.schema:
+pandas_type = _to_corrected_pandas_type(field.dataType)
+if (pandas_type):
+dtype[field.name] = pandas_type
+
+return pd.DataFrame.from_records(self.collect(), 
columns=self.columns).astype(dtype)
--- End diff --

The param `copy` of `astype` is true by default. Seems to me we don't need 
copying the data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18371: [SPARK-20889][SparkR] Grouped documentation for M...

2017-06-21 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18371#discussion_r123425200
  
--- Diff: R/pkg/R/functions.R ---
@@ -34,6 +34,30 @@ NULL
 #' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))}
 NULL
 
+#' Math functions for Column operations
+#'
+#' Math functions defined for \code{Column}.
+#'
+#' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} 
and \code{shiftRightUnsigned},
+#'  this is the number of bits to shift.
+#' @param y Column to compute on.
+#' @param ... additional argument(s).
+#' @name column_math_functions
+#' @rdname column_math_functions
+#' @family math functions
+#' @examples
+#' \dontrun{
+#' # Dataframe used throughout this doc
+#' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
+#' tmp <- mutate(df, v1 = log(df$mpg), v2 = cbrt(df$disp),
+#'   v3 = bround(df$wt, 1), v4 = bin(df$cyl),
+#'   v5 = hex(df$wt), v6 = toDegrees(df$gear),
+#'   v7 = atan2(df$cyl, df$am), v8 = hypot(df$cyl, df$am),
+#'   v9 = pmod(df$hp, df$cyl), v10 = shiftLeft(df$disp, 1),
+#'   v11 = conv(df$hp, 10, 16))
--- End diff --

Three more examples added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18371: [SPARK-20889][SparkR] Grouped documentation for MATH col...

2017-06-21 Thread actuaryzhang

Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/18371
  
Made another commit that addresses your comments. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18371: [SPARK-20889][SparkR] Grouped documentation for M...

2017-06-21 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18371#discussion_r123425179
  
--- Diff: R/pkg/R/functions.R ---
@@ -1405,18 +1309,12 @@ setMethod("sha1",
 column(jc)
   })
 
-#' signum
-#'
-#' Computes the signum of the given value.
-#'
-#' @param x Column to compute on.
+#' @details
+#' \code{signum}: Computes the signum of the given value.
--- End diff --

OK. fixed this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18378
  
**[Test build #78429 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78429/testReport)**
 for PR 18378 at commit 
[`1e98c49`](https://github.com/apache/spark/commit/1e98c494e0c414ca218b029bfc1a9d9faf3c2960).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...

2017-06-21 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18378
  
It sounds ok to me just except missing `_have_pandas = False` above `try:` .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17519: [SPARK-15352][Doc] follow-up: add configuration d...

2017-06-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17519#discussion_r123424505
  
--- Diff: docs/configuration.md ---
@@ -1004,14 +1004,48 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
-  spark.storage.replication.proactive
+  spark.storage.replication.proactive
   false
   
 Enables proactive block replication for RDD blocks. Cached RDD block 
replicas lost due to
 executor failures are replenished if there are any existing available 
replicas. This tries
 to get the replication level of the block to the initial number.
   
 
+
+  spark.storage.replication.policy
+  
+org.apache.spark.storage.RandomBlockReplicationPolicy
+  
+  
+The policy to use for choosing peers when replicating blocks. The 
default policy would randomly
+choose the peers to replicate to. A more resilient replication policy 
is provided by
+org.apache.spark.storage.BasicBlockReplicationPolicy, 
which makes use of the
+topology information of the hosts to choose the peers, much like the 
HDFS blocks replication
+strategy: it would try to choose the first replica within the same 
rack, and a third replica on
+a different rack. See 
spark.storage.replication.topologyMapper below for how to
+provide the topology information for the hosts.
+  
+
+
+  spark.storage.replication.topologyMapper
+  
+org.apache.spark.storage.DefaultTopologyMapper
+  
+  
+The topology information of a host is determined by a topology mapping 
service defined by the
+abstract class org.apache.spark.storage.TopologyMapper, 
which can be configured by
+this property. A default implementation that assumes all hosts are in 
the same rack is provided
+by org.apache.spark.storage.DefaultTopologyMapper. A 
file-based implementation is
+provided by 
org.apache.spark.storage.FileBasedTopologyMapper, which reads the
+topology information from the file 
org.apache.spark.storage.topologyFile. Each line
+of this file is of the format of host1 = /rack1 and 
provides a mapping from a host
+name to its rack information. Note: This configuration only 
takes effect when
+spark.storage.replication.policy is set to a a policy 
that takes the topology
--- End diff --

nit: double `a`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17519: [SPARK-15352][Doc] follow-up: add configuration d...

2017-06-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17519#discussion_r123424493
  
--- Diff: docs/configuration.md ---
@@ -1004,14 +1004,48 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
-  spark.storage.replication.proactive
+  spark.storage.replication.proactive
   false
   
 Enables proactive block replication for RDD blocks. Cached RDD block 
replicas lost due to
 executor failures are replenished if there are any existing available 
replicas. This tries
 to get the replication level of the block to the initial number.
   
 
+
+  spark.storage.replication.policy
+  
+org.apache.spark.storage.RandomBlockReplicationPolicy
+  
+  
+The policy to use for choosing peers when replicating blocks. The 
default policy would randomly
+choose the peers to replicate to. A more resilient replication policy 
is provided by
+org.apache.spark.storage.BasicBlockReplicationPolicy, 
which makes use of the
+topology information of the hosts to choose the peers, much like the 
HDFS blocks replication
+strategy: it would try to choose the first replica within the same 
rack, and a third replica on
+a different rack. See 
spark.storage.replication.topologyMapper below for how to
+provide the topology information for the hosts.
+  
+
+
+  spark.storage.replication.topologyMapper
+  
+org.apache.spark.storage.DefaultTopologyMapper
+  
+  
+The topology information of a host is determined by a topology mapping 
service defined by the
+abstract class org.apache.spark.storage.TopologyMapper, 
which can be configured by
+this property. A default implementation that assumes all hosts are in 
the same rack is provided
+by org.apache.spark.storage.DefaultTopologyMapper. A 
file-based implementation is
+provided by 
org.apache.spark.storage.FileBasedTopologyMapper, which reads the
+topology information from the file 
org.apache.spark.storage.topologyFile. Each line
--- End diff --

shall we also add an entry for `org.apache.spark.storage.topologyFile`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18382
  
**[Test build #78428 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78428/testReport)**
 for PR 18382 at commit 
[`9eaa6d6`](https://github.com/apache/spark/commit/9eaa6d689be820eeb02170ad2191b9547af3aa15).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R

2017-06-21 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18382
  
Thank you @felixcheung. I checked docs and the same thing in the PR 
description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18162: [SPARK-20923] turn tracking of TaskMetrics._updat...

2017-06-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18162#discussion_r123424121
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -295,4 +295,12 @@ package object config {
 "above this threshold. This is to avoid a giant request takes too 
much memory.")
   .bytesConf(ByteUnit.BYTE)
   .createWithDefaultString("200m")
+
+  private[spark] val TASK_METRICS_TRACK_UPDATED_BLOCK_STATUSES =
+ConfigBuilder("spark.taskMetrics.trackUpdatedBlockStatuses")
--- End diff --

you can document it in `configuration.md`.

Actually can we turn it off by default? I think this is feature is useless 
for most of users. cc @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18378
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78427/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18378
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18378
  
**[Test build #78427 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78427/testReport)**
 for PR 18378 at commit 
[`36dc5e7`](https://github.com/apache/spark/commit/36dc5e7df4549270e66b33d4d171898e8b21faae).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18162: [SPARK-20923] turn tracking of TaskMetrics._updat...

2017-06-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18162#discussion_r123424029
  
--- Diff: 
core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ---
@@ -528,7 +528,13 @@ class JobProgressListener(conf: SparkConf) extends 
SparkListener with Logging {
 new StageUIData
   })
   val taskData = stageData.taskData.get(taskId)
-  val metrics = TaskMetrics.fromAccumulatorInfos(accumUpdates)
+  val accumsFiltered = if 
(conf.get(TASK_METRICS_TRACK_UPDATED_BLOCK_STATUSES)) {
+accumUpdates
+  } else {
+accumUpdates.filter(info => info.name.isDefined && 
info.update.isDefined && info.name !=
--- End diff --

that also works


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18377: [SPARK-18016][SQL][CATALYST][BRANCH-2.2] Code Generation...

2017-06-21 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18377
  
thanks, merging to 2.2!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18354: [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation...

2017-06-21 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18354
  
thanks, merging to 2.1!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18362: [SPARK-20832][Core] Standalone master should expl...

2017-06-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18362#discussion_r123423356
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ---
@@ -454,6 +464,12 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
 }(ThreadUtils.sameThread)
   }
 
+  protected def removeWorker(workerId: String, host: String, message: 
String): Unit = {
+driverEndpoint.ask[Boolean](RemoveWorker(workerId, host, 
message)).onFailure { case t =>
+logError(t.getMessage, t)
--- End diff --

style nit: `case t => `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18362: [SPARK-20832][Core] Standalone master should expl...

2017-06-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18362#discussion_r123423287
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -569,6 +569,12 @@ private[spark] class TaskSchedulerImpl 
private[scheduler](
 }
   }
 
+  override def workerRemoved(workerId: String, host: String, message: 
String): Unit = {
+logInfo(s"Handle removed worker $workerId: $message")
+dagScheduler.workerRemoved(workerId, host, message)
+backend.reviveOffers()
--- End diff --

why this line?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18362: [SPARK-20832][Core] Standalone master should expl...

2017-06-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18362#discussion_r123423207
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1432,6 +1439,26 @@ class DAGScheduler(
 }
   }
 
+  /**
+   * Responds to a worker being removed. This is called inside the event 
loop, so it assumes it can
+   * modify the scheduler's internal state. Use workerRemoved() to post a 
loss event from outside.
+   *
+   * We will assume that we've lost all shuffle blocks associated with the 
host if a worker is
--- End diff --

what if the worker is replaced? 
https://github.com/apache/spark/pull/18362/files#diff-29dffdccd5a7f4c8b496c293e87c8668R759


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18384: [SPARK-21170] [CORE] Utils.tryWithSafeFinallyAndFailureC...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18384
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18378
  
**[Test build #78427 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78427/testReport)**
 for PR 18378 at commit 
[`36dc5e7`](https://github.com/apache/spark/commit/36dc5e7df4549270e66b33d4d171898e8b21faae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18362: [SPARK-20832][Core] Standalone master should expl...

2017-06-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18362#discussion_r123423005
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala 
---
@@ -158,6 +158,8 @@ private[deploy] object DeployMessages {
 
   case class ApplicationRemoved(message: String)
 
+  case class WorkerRemoved(id: String, host: String, message: String)
--- End diff --

do we still need the `ExecutorUpdated.workerLost`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18384: [SPARK-21170] [CORE] Utils.tryWithSafeFinallyAndF...

2017-06-21 Thread devaraj-kavali

GitHub user devaraj-kavali opened a pull request:

https://github.com/apache/spark/pull/18384

[SPARK-21170] [CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throws 
IllegalArgumentException: Self-suppression not permitted

## What changes were proposed in this pull request?

Not adding the exception to the suppressed if it is the same instance as 
originalThrowable.

## How was this patch tested?

Added new tests to verify this, these tests fail without source code 
changes and passes with the change.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/devaraj-kavali/spark SPARK-21170

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18384.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18384


commit d3f8846dbbccb52318d1f71cd4e5ecd352ae738e
Author: Devaraj K 
Date:   2017-06-22T05:04:08Z

[SPARK-21170] [CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throws
IllegalArgumentException: Self-suppression not permitted




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18118: [SPARK-20199][ML] : Provided featureSubsetStrateg...

2017-06-21 Thread pralabhkumar

Github user pralabhkumar commented on a diff in the pull request:

https://github.com/apache/spark/pull/18118#discussion_r123422892
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -192,6 +196,9 @@ object GBTClassifier extends 
DefaultParamsReadable[GBTClassifier] {
 
   @Since("2.0.0")
   override def load(path: String): GBTClassifier = super.load(path)
+
+  final val supportedFeatureSubsetStrategies: Array[String] =
--- End diff --

@sethah 
Done 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15417: [SPARK-17851][SQL][TESTS] Make sure all test sqls...

2017-06-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15417#discussion_r12342
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisTest.scala
 ---
@@ -55,6 +55,14 @@ trait AnalysisTest extends PlanTest {
 comparePlans(actualPlan, expectedPlan)
   }
 
+  protected override def comparePlans(
+  plan1: LogicalPlan,
+  plan2: LogicalPlan,
+  checkAnalysis: Boolean = false): Unit = {
+// Analysis tests may have not been fully resolved, so skip 
checkAnalysis.
--- End diff --

this comment explains why we set the default value of `checkAnalysis` to 
false, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18378
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78426/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18378
  
**[Test build #78426 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78426/testReport)**
 for PR 18378 at commit 
[`36f9cb6`](https://github.com/apache/spark/commit/36f9cb63f21600db4a95ce05d370a72245649100).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18378
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18378
  
**[Test build #78426 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78426/testReport)**
 for PR 18378 at commit 
[`36f9cb6`](https://github.com/apache/spark/commit/36f9cb63f21600db4a95ce05d370a72245649100).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18118
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18118
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78420/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18382: [SPARK-21149][R] Add job description API for R

2017-06-21 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18382#discussion_r123420890
  
--- Diff: R/pkg/R/context.R ---
@@ -295,6 +295,23 @@ setCheckpointDirSC <- function(sc, dirName) {
   invisible(callJMethod(sc, "setCheckpointDir", 
suppressWarnings(normalizePath(dirName
 }
 
+#' Set a human readable description of the current job.
+#'
+#' Set a description that is shown as a job description in UI.
+#'
+#' @rdname setJobDescription
+#' @param value The job description of the current job.
+#' @export
+#' @examples
+#'\dontrun{
+#' setJobDescription("This is an example job.")
+#'}
+#' @note setJobDescription since 2.3.0
+setJobDescription <- function(value) {
--- End diff --

and this should probably go to 
https://github.com/apache/spark/blob/422aa67d1bb84f913b06e6d94615adb6557e2870/R/pkg/R/sparkR.R#L536


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18382: [SPARK-21149][R] Add job description API for R

2017-06-21 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18382#discussion_r123420666
  
--- Diff: R/pkg/NAMESPACE ---
@@ -403,6 +403,7 @@ export("as.DataFrame",
"refreshTable",
"setCheckpointDir",
"setCurrentDatabase",
+   "setJobDescription",
--- End diff --

I think this should go to 
https://github.com/HyukjinKwon/spark/blob/c591e63d5937403bb462bdf76795af8e3f324aca/R/pkg/NAMESPACE#L78


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-21 Thread eatoncys

Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r123420930
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -258,23 +256,7 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
 if (mainClass == null && SparkSubmit.isUserJar(primaryResource)) {
   SparkSubmit.printErrorAndExit("No main class set in JAR; please 
specify one with --class")
 }
-if (driverMemory != null
-&& Try(JavaUtils.byteStringAsBytes(driverMemory)).getOrElse(-1L) 
<= 0) {
-  SparkSubmit.printErrorAndExit("Driver Memory must be a positive 
number")
-}
-if (executorMemory != null
-&& Try(JavaUtils.byteStringAsBytes(executorMemory)).getOrElse(-1L) 
<= 0) {
-  SparkSubmit.printErrorAndExit("Executor Memory cores must be a 
positive number")
-}
-if (executorCores != null && Try(executorCores.toInt).getOrElse(-1) <= 
0) {
-  SparkSubmit.printErrorAndExit("Executor cores must be a positive 
number")
-}
-if (totalExecutorCores != null && 
Try(totalExecutorCores.toInt).getOrElse(-1) <= 0) {
-  SparkSubmit.printErrorAndExit("Total executor cores must be a 
positive number")
-}
-if (numExecutors != null && Try(numExecutors.toInt).getOrElse(-1) <= 
0) {
-  SparkSubmit.printErrorAndExit("Number of executors must be a 
positive number")
-}
--- End diff --

@jerryshao  Ok ,I have moved them back, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18118
  
**[Test build #78420 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78420/testReport)**
 for PR 18118 at commit 
[`7970293`](https://github.com/apache/spark/commit/79702933d321051222073057b25305831df84c6d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-21 Thread eatoncys

Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r123420644
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -543,6 +545,42 @@ class SparkConf(loadDefaults: Boolean) extends 
Cloneable with Logging with Seria
   }
 }
 
+if (contains("spark.driver.memory")) {
+  val driverMemory = get("spark.driver.memory")
+  if (Try(JavaUtils.byteStringAsBytes(driverMemory)).getOrElse(-1L) <= 
0) {
+throw new IllegalArgumentException(s"spark.driver.memory " +
+  s"(was ${driverMemory}) can only be a positive number")
+  }
+}
+if (contains("spark.executor.memory")) {
+  val executorMemory = get("spark.executor.memory")
+  if (Try(JavaUtils.byteStringAsBytes(executorMemory)).getOrElse(-1L) 
<= 0) {
+throw new IllegalArgumentException(s"spark.executor.memory " +
+  s"(was ${executorMemory}) can only be a positive number")
+  }
--- End diff --

@jerryshao  Ok ,I have moved them back, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18300: [SPARK-21043][SQL] Add unionByName in Dataset

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18300
  
**[Test build #78425 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78425/testReport)**
 for PR 18300 at commit 
[`b17a14e`](https://github.com/apache/spark/commit/b17a14e3d8c752df09c6dbd32681a6352b185116).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18347: [SPARK-20599][SS] ConsoleSink should work with (batch)

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18347
  
**[Test build #78424 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78424/testReport)**
 for PR 18347 at commit 
[`5851d22`](https://github.com/apache/spark/commit/5851d226fd1531447882b9dab62728e5ce36e486).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18300: [SPARK-21043][SQL] Add unionByName in Dataset

2017-06-21 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18300#discussion_r123420282
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1764,6 +1765,70 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * Returns a new Dataset containing union of rows in this Dataset and 
another Dataset.
+   *
+   * This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. 
To do a SQL-style set
+   * union (that does deduplication of elements), use this function 
followed by a [[distinct]].
+   *
+   * The difference between this function and [[union]] is that this 
function
+   * resolves columns by name (not by position):
+   *
+   * {{{
+   *   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
+   *   val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
+   *   df1.unionByName(df2).show
+   *
+   *   // output:
+   *   // ++++
+   *   // |col0|col1|col2|
+   *   // ++++
+   *   // |   1|   2|   3|
+   *   // |   6|   4|   5|
+   *   // ++++
+   * }}}
+   *
+   * @group typedrel
+   * @since 2.3.0
+   */
+  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+// Resolves children first to reorder output attributes in `other` by 
name
+val leftPlan = sparkSession.sessionState.executePlan(logicalPlan)
+val rightPlan = 
sparkSession.sessionState.executePlan(other.logicalPlan)
--- End diff --

yea, it seems we needn't. removed. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-21 Thread eatoncys

Github user eatoncys commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r123420083
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala ---
@@ -704,6 +707,43 @@ class MasterSuite extends SparkFunSuite
   private def getState(master: Master): RecoveryState.Value = {
 master.invokePrivate(_state())
   }
+
+  test("Total cores is not divisible by cores per executor") {
--- End diff --

@jerryshao  Ok, I have removed them out, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18320: [SPARK-21093][R] Terminate R's worker processes i...

2017-06-21 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18320#discussion_r123417186
  
--- Diff: R/pkg/inst/worker/daemon.R ---
@@ -47,9 +79,11 @@ while (TRUE) {
   close(inputCon)
--- End diff --

please add comment here to say "# reach here because this is a child 
process"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18320: [SPARK-21093][R] Terminate R's worker processes i...

2017-06-21 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18320#discussion_r123416154
  
--- Diff: R/pkg/inst/worker/daemon.R ---
@@ -30,8 +30,40 @@ port <- as.integer(Sys.getenv("SPARKR_WORKER_PORT"))
 inputCon <- socketConnection(
 port = port, open = "rb", blocking = TRUE, timeout = connectionTimeout)
 
+# Waits indefinitely for a socket connecion by default.
+selectTimeout <- NULL
+
 while (TRUE) {
-  ready <- socketSelect(list(inputCon))
+  ready <- socketSelect(list(inputCon), timeout = selectTimeout)
--- End diff --

hmm, this is pre-existing behavior, but generally I think waiting 
indefinitely for anything would be rather dangerous. probably would be good to 
follow up separately


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18320: [SPARK-21093][R] Terminate R's worker processes i...

2017-06-21 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18320#discussion_r123416771
  
--- Diff: R/pkg/inst/worker/daemon.R ---
@@ -30,8 +30,40 @@ port <- as.integer(Sys.getenv("SPARKR_WORKER_PORT"))
 inputCon <- socketConnection(
 port = port, open = "rb", blocking = TRUE, timeout = connectionTimeout)
 
+# Waits indefinitely for a socket connecion by default.
+selectTimeout <- NULL
+
 while (TRUE) {
-  ready <- socketSelect(list(inputCon))
+  ready <- socketSelect(list(inputCon), timeout = selectTimeout)
+
+  # Note that the children should be terminated in the parent. If each 
child terminates
+  # itself, it appears that the resource is not released properly, that 
causes an unexpected
+  # termination of this daemon due to, for example, running out of file 
descriptors
+  # (see SPARK-21093). Therefore, the current implementation tries to 
retrieve children
+  # that are exited (but not terminated) and then sends a kill signal to 
terminate them properly
+  # in the parent.
+  #
+  # There are two paths that it attempts to send a signal to terminate the 
children in the parent.
+  #
+  #   1. Every second if any socket connection is not available and if 
there are child workers
+  # running.
+  #   2. Right after a socket connection is available.
+  #
+  # In other words, the parent attempts to send the signal to the children 
every second if
+  # any worker is running or right before launching other worker children 
from the following
+  # new socket connection.
+
+  # Only the process IDs of exited children are returned and the 
termination is attempted below.
+  children <- parallel:::selectChildren(timeout = 0)
+  if (is.integer(children)) {
+# If it is PIDs, there are workers exited but not terminated. Attempts 
to terminate them
--- End diff --

If I understanding correctly, children (from `parallel:::selectChildren()`) 
as integer only indicates the list of fork/child process. But it does not 
indicate the child process "exited"?

When we are in a loop checking every second (ie. selectTimeout == 1), it 
sounds to me like socketSelect could return in one of the following ways:
- socket available for reading (ready == TRUE)
- socket not available for reading, it's dead or exiting (ready == FALSE)
- socket not available for reading, but it's not ready to connect because 
say things are running slow or the host is busy etc, and it has effectively 
been just timed out since selectTimeout (ready == FALSE)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18320: [SPARK-21093][R] Terminate R's worker processes i...

2017-06-21 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18320#discussion_r123419616
  
--- Diff: R/pkg/inst/worker/daemon.R ---
@@ -30,8 +30,40 @@ port <- as.integer(Sys.getenv("SPARKR_WORKER_PORT"))
 inputCon <- socketConnection(
 port = port, open = "rb", blocking = TRUE, timeout = connectionTimeout)
 
+# Waits indefinitely for a socket connecion by default.
+selectTimeout <- NULL
+
 while (TRUE) {
-  ready <- socketSelect(list(inputCon))
+  ready <- socketSelect(list(inputCon), timeout = selectTimeout)
+
+  # Note that the children should be terminated in the parent. If each 
child terminates
+  # itself, it appears that the resource is not released properly, that 
causes an unexpected
+  # termination of this daemon due to, for example, running out of file 
descriptors
+  # (see SPARK-21093). Therefore, the current implementation tries to 
retrieve children
+  # that are exited (but not terminated) and then sends a kill signal to 
terminate them properly
+  # in the parent.
+  #
+  # There are two paths that it attempts to send a signal to terminate the 
children in the parent.
+  #
+  #   1. Every second if any socket connection is not available and if 
there are child workers
+  # running.
+  #   2. Right after a socket connection is available.
+  #
+  # In other words, the parent attempts to send the signal to the children 
every second if
+  # any worker is running or right before launching other worker children 
from the following
+  # new socket connection.
+
+  # Only the process IDs of exited children are returned and the 
termination is attempted below.
+  children <- parallel:::selectChildren(timeout = 0)
+  if (is.integer(children)) {
+# If it is PIDs, there are workers exited but not terminated. Attempts 
to terminate them
--- End diff --

right, I see your reference here 
https://github.com/apache/spark/pull/18320#discussion_r122639738
but I'm not 100% getting it when looking at the source code 
https://github.com/s-u/multicore/blob/e9d9bf21e6cf08e24cfe54d762379b4fa923765b/src/fork.c#L361


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...

2017-06-21 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18114#discussion_r123419433
  
--- Diff: R/pkg/R/functions.R ---
@@ -2414,20 +2396,23 @@ setMethod("from_json", signature(x = "Column", 
schema = "structType"),
 column(jc)
   })
 
-#' from_utc_timestamp
-#'
-#' Given a timestamp, which corresponds to a certain time of day in UTC, 
returns another timestamp
-#' that corresponds to the same time of day in the given timezone.
+#' @details
+#' \code{from_utc_timestamp}: Given a timestamp, which corresponds to a 
certain time of day in UTC,
+#' returns another timestamp that corresponds to the same time of day in 
the given timezone.
 #'
-#' @param y Column to compute on.
-#' @param x time zone to use.
+#' @rdname column_datetime_diff_functions
 #'
-#' @family date time functions
-#' @rdname from_utc_timestamp
-#' @name from_utc_timestamp
-#' @aliases from_utc_timestamp,Column,character-method
+#' @aliases from_utc_timestamp from_utc_timestamp,Column,character-method
 #' @export
-#' @examples \dontrun{from_utc_timestamp(df$t, 'PST')}
+#' @examples
+#'
+#' \dontrun{
+#' tmp <- mutate(df, from_utc = from_utc_timestamp(df$time, 'PST'),
+#'  to_utc = to_utc_timestamp(df$time, 'PST'),
+#'  to_unix = unix_timestamp(df$time),
+#'  to_unix2 = unix_timestamp(df$time, '-MM-dd HH'),
+#'  from_unix = from_unixtime(unix_timestamp(df$time)))
--- End diff --

Actually, what I found was the documentation for `unix_timestamp` and 
`from_unixtime` was in `column_datetime_functions` but the examples in 
`column_datetime_diff_functions`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...

2017-06-21 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18114#discussion_r123419080
  
--- Diff: R/pkg/R/functions.R ---
@@ -2414,20 +2396,23 @@ setMethod("from_json", signature(x = "Column", 
schema = "structType"),
 column(jc)
   })
 
-#' from_utc_timestamp
-#'
-#' Given a timestamp, which corresponds to a certain time of day in UTC, 
returns another timestamp
-#' that corresponds to the same time of day in the given timezone.
+#' @details
+#' \code{from_utc_timestamp}: Given a timestamp, which corresponds to a 
certain time of day in UTC,
+#' returns another timestamp that corresponds to the same time of day in 
the given timezone.
 #'
-#' @param y Column to compute on.
-#' @param x time zone to use.
+#' @rdname column_datetime_diff_functions
 #'
-#' @family date time functions
-#' @rdname from_utc_timestamp
-#' @name from_utc_timestamp
-#' @aliases from_utc_timestamp,Column,character-method
+#' @aliases from_utc_timestamp from_utc_timestamp,Column,character-method
 #' @export
-#' @examples \dontrun{from_utc_timestamp(df$t, 'PST')}
+#' @examples
+#'
+#' \dontrun{
+#' tmp <- mutate(df, from_utc = from_utc_timestamp(df$time, 'PST'),
+#'  to_utc = to_utc_timestamp(df$time, 'PST'),
+#'  to_unix = unix_timestamp(df$time),
+#'  to_unix2 = unix_timestamp(df$time, '-MM-dd HH'),
+#'  from_unix = from_unixtime(unix_timestamp(df$time)))
--- End diff --

this I don't get - why `from_unixtime` belongs to 
`column_datetime_diff_functions`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18329: [SPARK-19909][SS] Disabling the usage of a tempor...

2017-06-21 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18329#discussion_r123418793
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala 
---
@@ -264,12 +281,12 @@ final class DataStreamWriter[T] private[sql](ds: 
Dataset[T]) {
 df,
 sink,
 outputMode,
-useTempCheckpointLocation = true,
+useTempCheckpointLocation = isTempCheckpointLocationAvailable,
 trigger = trigger)
 } else {
   val (useTempCheckpointLocation, recoverFromCheckpointLocation) =
 if (source == "console") {
-  (true, false)
+  (isTempCheckpointLocationAvailable, false)
--- End diff --

@mgaido91 AFAIK whether `useTempCheckpointLocation` is `true` or `false` is 
based on the type of `Sink`, here with your change, now the semantics are 
changing to wether `tmpFs` equals `defaultFs` or `defaultFs` is local FS. So 
looks like the semantics are different now. I'm not if it is a valid fix.

@zsxwing would you please help to review this patch? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18327: [SPARK-21047] Add test suites for complicated cases in C...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18327
  
**[Test build #78423 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78423/testReport)**
 for PR 18327 at commit 
[`9319a2f`](https://github.com/apache/spark/commit/9319a2f73cca767b7ec3e4abb1d213bbe535d122).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...

2017-06-21 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18114#discussion_r123416302
  
--- Diff: R/pkg/R/functions.R ---
@@ -2414,20 +2396,23 @@ setMethod("from_json", signature(x = "Column", 
schema = "structType"),
 column(jc)
   })
 
-#' from_utc_timestamp
-#'
-#' Given a timestamp, which corresponds to a certain time of day in UTC, 
returns another timestamp
-#' that corresponds to the same time of day in the given timezone.
+#' @details
+#' \code{from_utc_timestamp}: Given a timestamp, which corresponds to a 
certain time of day in UTC,
+#' returns another timestamp that corresponds to the same time of day in 
the given timezone.
 #'
-#' @param y Column to compute on.
-#' @param x time zone to use.
+#' @rdname column_datetime_diff_functions
 #'
-#' @family date time functions
-#' @rdname from_utc_timestamp
-#' @name from_utc_timestamp
-#' @aliases from_utc_timestamp,Column,character-method
+#' @aliases from_utc_timestamp from_utc_timestamp,Column,character-method
 #' @export
-#' @examples \dontrun{from_utc_timestamp(df$t, 'PST')}
+#' @examples
+#'
+#' \dontrun{
+#' tmp <- mutate(df, from_utc = from_utc_timestamp(df$time, 'PST'),
+#'  to_utc = to_utc_timestamp(df$time, 'PST'),
+#'  to_unix = unix_timestamp(df$time),
+#'  to_unix2 = unix_timestamp(df$time, '-MM-dd HH'),
+#'  from_unix = from_unixtime(unix_timestamp(df$time)))
--- End diff --

It looks this one should go to `column_datetime_functions` or 
`from_unixtime` should be in `column_datetime_diff_functions`.

`unix_timestamp` looks a ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...

2017-06-21 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18114#discussion_r123416414
  
--- Diff: R/pkg/R/functions.R ---
@@ -2414,20 +2396,23 @@ setMethod("from_json", signature(x = "Column", 
schema = "structType"),
 column(jc)
   })
 
-#' from_utc_timestamp
-#'
-#' Given a timestamp, which corresponds to a certain time of day in UTC, 
returns another timestamp
-#' that corresponds to the same time of day in the given timezone.
+#' @details
+#' \code{from_utc_timestamp}: Given a timestamp, which corresponds to a 
certain time of day in UTC,
+#' returns another timestamp that corresponds to the same time of day in 
the given timezone.
 #'
-#' @param y Column to compute on.
-#' @param x time zone to use.
+#' @rdname column_datetime_diff_functions
 #'
-#' @family date time functions
-#' @rdname from_utc_timestamp
-#' @name from_utc_timestamp
-#' @aliases from_utc_timestamp,Column,character-method
+#' @aliases from_utc_timestamp from_utc_timestamp,Column,character-method
 #' @export
-#' @examples \dontrun{from_utc_timestamp(df$t, 'PST')}
+#' @examples
+#'
+#' \dontrun{
+#' tmp <- mutate(df, from_utc = from_utc_timestamp(df$time, 'PST'),
+#'  to_utc = to_utc_timestamp(df$time, 'PST'),
+#'  to_unix = unix_timestamp(df$time),
+#'  to_unix2 = unix_timestamp(df$time, '-MM-dd HH'),
+#'  from_unix = from_unixtime(unix_timestamp(df$time)))
--- End diff --

And ... `from_unixtime(df$t, '/MM/dd HH')` looks missed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...

2017-06-21 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18114#discussion_r123417000
  
--- Diff: R/pkg/R/functions.R ---
@@ -34,6 +34,58 @@ NULL
 #' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))}
 NULL
 
+#' Date time functions for Column operations
+#'
+#' Date time functions defined for \code{Column}.
+#'
+#' @param x Column to compute on.
+#' @param format For \code{to_date} and \code{to_timestamp}, it is the 
string to use to parse
+#'   x Column to DateType or TimestampType. For \code{trunc}, 
it is the string used
+#'   for specifying the truncation method. For example, 
"year", "", "yy" for
+#'   truncate by year, or "month", "mon", "mm" for truncate by 
month.
+#' @param ... additional argument(s).
+#' @name column_datetime_functions
+#' @rdname column_datetime_functions
+#' @family data time functions
+#' @examples
+#' \dontrun{
+#' dts <- c("2005-01-02 18:47:22",
+#' "2005-12-24 16:30:58",
+#' "2005-10-28 07:30:05",
+#' "2005-12-28 07:01:05",
+#' "2006-01-24 00:01:10")
+#' y <- c(2.0, 2.2, 3.4, 2.5, 1.8)
+#' df <- createDataFrame(data.frame(time = as.POSIXct(dts), y = y))}
+NULL
+
+#' Date time arithmetic functions for Column operations
+#'
+#' Date time arithmetic functions defined for \code{Column}.
+#'
+#' @param y Column to compute on.
+#' @param x For class \code{Column}, it is the column used to perform 
arithmetic operations
+#'  with column \code{y}.For class \code{numeric}, it is the 
number of months or
+#'  days to be added to or subtracted from \code{y}. For class 
\code{character}, it is
+#'  \itemize{
+#'  \item \code{date_format}: date format specification.
+#'  \item \code{from_utc_timestamp, to_utc_timestamp}: time zone 
to use.
--- End diff --

little nit `\code{from_utc_timestamp}, \code{to_utc_timestamp}`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...

2017-06-21 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18114#discussion_r123416202
  
--- Diff: R/pkg/R/functions.R ---
@@ -34,6 +34,58 @@ NULL
 #' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))}
 NULL
 
+#' Date time functions for Column operations
+#'
+#' Date time functions defined for \code{Column}.
+#'
+#' @param x Column to compute on.
+#' @param format For \code{to_date} and \code{to_timestamp}, it is the 
string to use to parse
+#'   x Column to DateType or TimestampType. For \code{trunc}, 
it is the string used
+#'   for specifying the truncation method. For example, 
"year", "", "yy" for
+#'   truncate by year, or "month", "mon", "mm" for truncate by 
month.
+#' @param ... additional argument(s).
+#' @name column_datetime_functions
+#' @rdname column_datetime_functions
+#' @family data time functions
+#' @examples
+#' \dontrun{
+#' dts <- c("2005-01-02 18:47:22",
+#' "2005-12-24 16:30:58",
+#' "2005-10-28 07:30:05",
+#' "2005-12-28 07:01:05",
+#' "2006-01-24 00:01:10")
+#' y <- c(2.0, 2.2, 3.4, 2.5, 1.8)
+#' df <- createDataFrame(data.frame(time = as.POSIXct(dts), y = y))}
+NULL
+
+#' Date time arithmetic functions for Column operations
+#'
+#' Date time arithmetic functions defined for \code{Column}.
+#'
+#' @param y Column to compute on.
+#' @param x For class \code{Column}, it is the column used to perform 
arithmetic operations
+#'  with column \code{y}.For class \code{numeric}, it is the 
number of months or
--- End diff --

little nit `.F` ->`. F`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18329: [SPARK-19909][SS] Disabling the usage of a tempor...

2017-06-21 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18329#discussion_r123417606
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala 
---
@@ -235,6 +237,21 @@ final class DataStreamWriter[T] private[sql](ds: 
Dataset[T]) {
 "write files of Hive data source directly.")
 }
 
+val hadoopConf = df.sparkSession.sessionState.newHadoopConf()
+val defaultFS = FileSystem.getDefaultUri(hadoopConf).getScheme
+val tmpFS = new URI(System.getProperty("java.io.tmpdir")).getScheme
+
+val isTempCheckpointLocationAvailable = tmpFS match {
+  case null | "file" =>
+if (defaultFS == null || defaultFS.equals("file")) {
+  true
+} else {
+  false
+}
+  case defaultFS => true
--- End diff --

@mgaido91 , if defaultFS is HDFS, in which scenario `tmpFS` 
("java.io.tmpdir") be the same as `defaultFS`? From my understanding, unless we 
change "java.io.tmpdir" property, then `tmpFs` will always be local FS.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18329: [SPARK-19909][SS] Disabling the usage of a tempor...

2017-06-21 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18329#discussion_r123416837
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala 
---
@@ -17,8 +17,10 @@
 
 package org.apache.spark.sql.streaming
 
+import java.net.URI
 import java.util.Locale
 
+import org.apache.hadoop.fs.FileSystem
--- End diff --

The import ordering is not correct, third-party packages should be put 
under Scala packages. You can locally verify the style.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18300: [SPARK-21043][SQL] Add unionByName in Dataset

2017-06-21 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18300#discussion_r123416779
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1764,6 +1765,70 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * Returns a new Dataset containing union of rows in this Dataset and 
another Dataset.
+   *
+   * This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. 
To do a SQL-style set
+   * union (that does deduplication of elements), use this function 
followed by a [[distinct]].
+   *
+   * The difference between this function and [[union]] is that this 
function
+   * resolves columns by name (not by position):
+   *
+   * {{{
+   *   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
+   *   val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
+   *   df1.unionByName(df2).show
+   *
+   *   // output:
+   *   // ++++
+   *   // |col0|col1|col2|
+   *   // ++++
+   *   // |   1|   2|   3|
+   *   // |   6|   4|   5|
+   *   // ++++
+   * }}}
+   *
+   * @group typedrel
+   * @since 2.3.0
+   */
+  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+// Resolves children first to reorder output attributes in `other` by 
name
+val leftPlan = sparkSession.sessionState.executePlan(logicalPlan)
+val rightPlan = 
sparkSession.sessionState.executePlan(other.logicalPlan)
+leftPlan.assertAnalyzed()
+rightPlan.assertAnalyzed()
+
+// Check column name duplication
+val resolver = sparkSession.sessionState.analyzer.resolver
+val leftOutputAttrs = leftPlan.analyzed.output
+val rightOutputAttrs = rightPlan.analyzed.output
+// SchemaUtils.checkColumnNameDuplication(
+//   leftOutputAttrs.map(_.name),
+//   "in the left attributes",
+//   sparkSession.sessionState.conf.caseSensitiveAnalysis)
+// SchemaUtils.checkColumnNameDuplication(
+//   rightOutputAttrs.map(_.name),
+//   "in the right attributes",
+//   sparkSession.sessionState.conf.caseSensitiveAnalysis)
--- End diff --

The function to check name duplication is discussed in #17758. I'm planning 
to use the func to check the duplication and then do union-by. See the 
discussion: https://github.com/apache/spark/pull/18300#discussion_r122116812


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17713: [SPARK-20417][SQL] Move subquery error handling to check...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17713
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78418/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18228: [SPARK-21007][SQL]Add SQL function - RIGHT && LEFT

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18228
  
**[Test build #78422 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78422/testReport)**
 for PR 18228 at commit 
[`fca984d`](https://github.com/apache/spark/commit/fca984d762577f61561c5da5c983f61b22e7757a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17713: [SPARK-20417][SQL] Move subquery error handling to check...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17713
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17713: [SPARK-20417][SQL] Move subquery error handling to check...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17713
  
**[Test build #78418 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78418/testReport)**
 for PR 17713 at commit 
[`c2a7555`](https://github.com/apache/spark/commit/c2a7555703c254f524645c0136dd10c2827e1d83).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18114: [SPARK-20889][SparkR] Grouped documentation for DATETIME...

2017-06-21 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18114
  
@felixcheung, would you give me a moment to double check? I am interested 
in this and want to help double check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18382
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78421/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18382
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18382
  
**[Test build #78421 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78421/testReport)**
 for PR 18382 at commit 
[`c591e63`](https://github.com/apache/spark/commit/c591e63d5937403bb462bdf76795af8e3f324aca).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18383: [SPARK-21167][SS] Set kafka clientId while fetch message...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18383
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-21 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r123415327
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -543,6 +545,42 @@ class SparkConf(loadDefaults: Boolean) extends 
Cloneable with Logging with Seria
   }
 }
 
+if (contains("spark.driver.memory")) {
+  val driverMemory = get("spark.driver.memory")
+  if (Try(JavaUtils.byteStringAsBytes(driverMemory)).getOrElse(-1L) <= 
0) {
+throw new IllegalArgumentException(s"spark.driver.memory " +
+  s"(was ${driverMemory}) can only be a positive number")
+  }
+}
+if (contains("spark.executor.memory")) {
+  val executorMemory = get("spark.executor.memory")
+  if (Try(JavaUtils.byteStringAsBytes(executorMemory)).getOrElse(-1L) 
<= 0) {
+throw new IllegalArgumentException(s"spark.executor.memory " +
+  s"(was ${executorMemory}) can only be a positive number")
+  }
--- End diff --

This above two checks seems unnecessary, let's not change unrelated code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-21 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r123415688
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -258,23 +256,7 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
 if (mainClass == null && SparkSubmit.isUserJar(primaryResource)) {
   SparkSubmit.printErrorAndExit("No main class set in JAR; please 
specify one with --class")
 }
-if (driverMemory != null
-&& Try(JavaUtils.byteStringAsBytes(driverMemory)).getOrElse(-1L) 
<= 0) {
-  SparkSubmit.printErrorAndExit("Driver Memory must be a positive 
number")
-}
-if (executorMemory != null
-&& Try(JavaUtils.byteStringAsBytes(executorMemory)).getOrElse(-1L) 
<= 0) {
-  SparkSubmit.printErrorAndExit("Executor Memory cores must be a 
positive number")
-}
-if (executorCores != null && Try(executorCores.toInt).getOrElse(-1) <= 
0) {
-  SparkSubmit.printErrorAndExit("Executor cores must be a positive 
number")
-}
-if (totalExecutorCores != null && 
Try(totalExecutorCores.toInt).getOrElse(-1) <= 0) {
-  SparkSubmit.printErrorAndExit("Total executor cores must be a 
positive number")
-}
-if (numExecutors != null && Try(numExecutors.toInt).getOrElse(-1) <= 
0) {
-  SparkSubmit.printErrorAndExit("Number of executors must be a 
positive number")
-}
--- End diff --

The above changes are valid and useful, I'd suggest to not change it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18128: [SPARK-20906][SparkR]:Constrained Logistic Regres...

2017-06-21 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18128


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...

2017-06-21 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/18128
  
merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18383: [SPARK-21167][SS] Set kafka clientId while fetch ...

2017-06-21 Thread dijingran

GitHub user dijingran opened a pull request:

https://github.com/apache/spark/pull/18383

[SPARK-21167][SS] Set kafka clientId while fetch messages

## What changes were proposed in this pull request?

Change KafkaRDD to set kafka clientId while fetch messages, as in our case 
,we use clientId at kafka server side.

## How was this patch tested?

All tests still passed.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dijingran/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18383


commit 917eda2d98ea584e2a64fcc3b942a23b206c290f
Author: dixingxing 
Date:   2017-06-22T03:29:52Z

Set kafka clientId while fatch messages




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18114: [SPARK-20889][SparkR] Grouped documentation for DATETIME...

2017-06-21 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/18114
  
hmm, waiting for AppVeyor


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...

2017-06-21 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18322#discussion_r123413523
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala ---
@@ -704,6 +707,43 @@ class MasterSuite extends SparkFunSuite
   private def getState(master: Master): RecoveryState.Value = {
 master.invokePrivate(_state())
   }
+
+  test("Total cores is not divisible by cores per executor") {
--- End diff --

I see, if there's no better way to verify it, I think it is not useful to 
add this two UTs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18382
  
**[Test build #78421 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78421/testReport)**
 for PR 18382 at commit 
[`c591e63`](https://github.com/apache/spark/commit/c591e63d5937403bb462bdf76795af8e3f324aca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18382: [SPARK-21149][R] Add job description API for R

2017-06-21 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18382#discussion_r123412816
  
--- Diff: R/pkg/R/context.R ---
@@ -295,6 +295,23 @@ setCheckpointDirSC <- function(sc, dirName) {
   invisible(callJMethod(sc, "setCheckpointDir", 
suppressWarnings(normalizePath(dirName
 }
 
+#' Set a human readable description of the current job.
+#'
+#' Set a description that is shown as a job description in UI.
+#'
+#' @rdname setJobDescription
+#' @param value The job description of the current job.
+#' @export
+#' @examples
+#'\dontrun{
+#' setJobDescription("This is an example job.")
+#'}
+#' @note setJobDescription since 2.3.0
--- End diff --

Just for sure, 

![2017-06-22 12 06 
27](https://user-images.githubusercontent.com/6477701/27415694-54b2889c-5743-11e7-89c5-2d7e82e0babe.png)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18362: [SPARK-20832][Core] Standalone master should explicitly ...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18362
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78415/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18327: [SPARK-21047] Add test suites for complicated cases in C...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18327
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78416/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18362: [SPARK-20832][Core] Standalone master should explicitly ...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18362
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18327: [SPARK-21047] Add test suites for complicated cases in C...

2017-06-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18327
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18362: [SPARK-20832][Core] Standalone master should explicitly ...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18362
  
**[Test build #78415 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78415/testReport)**
 for PR 18362 at commit 
[`fb72fb6`](https://github.com/apache/spark/commit/fb72fb6ea8dc02753d8c699f985afc70d1ccd971).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18327: [SPARK-21047] Add test suites for complicated cases in C...

2017-06-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18327
  
**[Test build #78416 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78416/testReport)**
 for PR 18327 at commit 
[`78102f3`](https://github.com/apache/spark/commit/78102f37c884b765596a7308563239acf45ed9b0).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18300: [SPARK-21043][SQL] Add unionByName in Dataset

2017-06-21 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18300#discussion_r123412488
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1764,6 +1765,70 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * Returns a new Dataset containing union of rows in this Dataset and 
another Dataset.
+   *
+   * This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. 
To do a SQL-style set
+   * union (that does deduplication of elements), use this function 
followed by a [[distinct]].
+   *
+   * The difference between this function and [[union]] is that this 
function
+   * resolves columns by name (not by position):
+   *
+   * {{{
+   *   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
+   *   val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
+   *   df1.unionByName(df2).show
+   *
+   *   // output:
+   *   // ++++
+   *   // |col0|col1|col2|
+   *   // ++++
+   *   // |   1|   2|   3|
+   *   // |   6|   4|   5|
+   *   // ++++
+   * }}}
+   *
+   * @group typedrel
+   * @since 2.3.0
+   */
+  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+// Resolves children first to reorder output attributes in `other` by 
name
+val leftPlan = sparkSession.sessionState.executePlan(logicalPlan)
+val rightPlan = 
sparkSession.sessionState.executePlan(other.logicalPlan)
+leftPlan.assertAnalyzed()
+rightPlan.assertAnalyzed()
+
+// Check column name duplication
+val resolver = sparkSession.sessionState.analyzer.resolver
+val leftOutputAttrs = leftPlan.analyzed.output
+val rightOutputAttrs = rightPlan.analyzed.output
+// SchemaUtils.checkColumnNameDuplication(
+//   leftOutputAttrs.map(_.name),
+//   "in the left attributes",
+//   sparkSession.sessionState.conf.caseSensitiveAnalysis)
+// SchemaUtils.checkColumnNameDuplication(
+//   rightOutputAttrs.map(_.name),
+//   "in the right attributes",
+//   sparkSession.sessionState.conf.caseSensitiveAnalysis)
--- End diff --

Why above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123408655
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala
 ---
@@ -1186,3 +1186,51 @@ case class BRound(child: Expression, scale: 
Expression)
 with Serializable with ImplicitCastInputTypes {
   def this(child: Expression) = this(child, Literal(0))
 }
+
+/**
+ *  The bucket number into which
--- End diff --

nit: Returns the bucket number


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123409892
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/operators.sql ---
@@ -92,3 +92,8 @@ select abs(-3.13), abs('-2.19');
 
 -- positive/negative
 select positive('-1.11'), positive(-1.11), negative('-1.11'), 
negative(-1.11);
+
+-- width_bucket
--- End diff --

Instead, we need to add a test for sql queries using this function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123401825
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -631,3 +631,109 @@ abstract class TernaryExpression extends Expression {
 }
   }
 }
+
+/**
+ * An expression with four inputs and one output. The output is by default 
evaluated to null
+ * if any input is evaluated to null.
+ */
+abstract class QuaternaryExpression extends Expression {
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def nullable: Boolean = children.exists(_.nullable)
+
+  /**
+   * Default behavior of evaluation according to the default nullability 
of TernaryExpression.
+   * If subclass of TernaryExpression override nullable, probably should 
also override this.
+   */
+  override def eval(input: InternalRow): Any = {
+val exprs = children
+val value1 = exprs(0).eval(input)
+if (value1 != null) {
+  val value2 = exprs(1).eval(input)
+  if (value2 != null) {
+val value3 = exprs(2).eval(input)
+if (value3 != null) {
+  val value4 = exprs(3).eval(input)
+  if (value4 != null) {
+return nullSafeEval(value1, value2, value3, value4)
+  }
+}
+  }
+}
+null
+  }
+
+  /**
+   * Called by default [[eval]] implementation.  If subclass of 
TernaryExpression keep the default
+   * nullability, they can override this method to save null-check code.  
If we need full control
+   * of evaluation process, we should override [[eval]].
+   */
+  protected def nullSafeEval(input1: Any, input2: Any, input3: Any, 
input4: Any): Any =
+sys.error(s"TernaryExpressions must override either eval or 
nullSafeEval")
+
+  /**
+   * Short hand for generating ternary evaluation code.
+   * If either of the sub-expressions is null, the result of this 
computation
+   * is assumed to be null.
+   *
+   * @param f accepts three variable names and returns Java code to 
compute the output.
+   */
+  protected def defineCodeGen(
+ctx: CodegenContext,
+ev: ExprCode,
+f: (String, String, String, String) => String): ExprCode = {
+nullSafeCodeGen(ctx, ev, (eval1, eval2, eval3, eval4) => {
+  s"${ev.value} = ${f(eval1, eval2, eval3, eval3)};"
--- End diff --

the last `eval3` -> `eval4`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123410671
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala 
---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.AnalysisException
+
+object MathUtils {
+
+  /**
+   *  Returns the bucket number into which
+   *  the value of this expression would fall after being evaluated.
+   *
+   * @param expr id the expression for which the histogram is being created
+   * @param minValue is an expression that resolves
+   * to the minimum end point of the acceptable range for 
expr
+   * @param maxValue is an expression that resolves
+   * to the maximum end point of the acceptable range for 
expr
+   * @param numBucket is an An expression that resolves to
+   *  a constant indicating the number of buckets
+   * @return Returns an long between 0 and numBucket+1 by mapping the expr 
into buckets defined by
+   * the range [minValue, maxValue]
+   */
+  def widthBucket(expr: Double, minValue: Double, maxValue: Double, 
numBucket: Long): Long = {
+
+if (numBucket <= 0) {
+  throw new AnalysisException(s"The num of bucket must be greater than 
0, but got ${numBucket}")
+}
+
+val lower: Double = Math.min(minValue, maxValue)
+val upper: Double = Math.max(minValue, maxValue)
+
+val preResult: Long = if (expr < lower) {
+  0
+} else if (expr >= upper) {
+  Math.addExact(numBucket, 1)
+} else {
+  (numBucket.toDouble * (expr - lower) / (upper - lower) + 1).toLong
+}
+
+val result = if (minValue > maxValue) (numBucket - preResult) + 1 else 
preResult
+result
--- End diff --

nit: `if (minValue > maxValue) (numBucket - preResult) + 1 else preResult`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123411306
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala 
---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.AnalysisException
+
+object MathUtils {
+
+  /**
+   *  Returns the bucket number into which
+   *  the value of this expression would fall after being evaluated.
+   *
+   * @param expr id the expression for which the histogram is being created
+   * @param minValue is an expression that resolves
+   * to the minimum end point of the acceptable range for 
expr
+   * @param maxValue is an expression that resolves
+   * to the maximum end point of the acceptable range for 
expr
+   * @param numBucket is an An expression that resolves to
+   *  a constant indicating the number of buckets
+   * @return Returns an long between 0 and numBucket+1 by mapping the expr 
into buckets defined by
+   * the range [minValue, maxValue]
+   */
+  def widthBucket(expr: Double, minValue: Double, maxValue: Double, 
numBucket: Long): Long = {
+
+if (numBucket <= 0) {
+  throw new AnalysisException(s"The num of bucket must be greater than 
0, but got ${numBucket}")
+}
+
+val lower: Double = Math.min(minValue, maxValue)
+val upper: Double = Math.max(minValue, maxValue)
+
+val preResult: Long = if (expr < lower) {
+  0
+} else if (expr >= upper) {
+  Math.addExact(numBucket, 1)
+} else {
+  (numBucket.toDouble * (expr - lower) / (upper - lower) + 1).toLong
--- End diff --

what if upper == lower?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123411090
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala 
---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.AnalysisException
+
+object MathUtils {
+
+  /**
+   *  Returns the bucket number into which
+   *  the value of this expression would fall after being evaluated.
+   *
+   * @param expr id the expression for which the histogram is being created
+   * @param minValue is an expression that resolves
+   * to the minimum end point of the acceptable range for 
expr
+   * @param maxValue is an expression that resolves
+   * to the maximum end point of the acceptable range for 
expr
+   * @param numBucket is an An expression that resolves to
+   *  a constant indicating the number of buckets
+   * @return Returns an long between 0 and numBucket+1 by mapping the expr 
into buckets defined by
+   * the range [minValue, maxValue]
+   */
+  def widthBucket(expr: Double, minValue: Double, maxValue: Double, 
numBucket: Long): Long = {
+
+if (numBucket <= 0) {
+  throw new AnalysisException(s"The num of bucket must be greater than 
0, but got ${numBucket}")
+}
+
+val lower: Double = Math.min(minValue, maxValue)
+val upper: Double = Math.max(minValue, maxValue)
+
+val preResult: Long = if (expr < lower) {
+  0
+} else if (expr >= upper) {
+  Math.addExact(numBucket, 1)
--- End diff --

Do we really need to use `addExact`? In Oracle's doc, `numBucket` is an 
integer, then we can use `numBucket + 1L` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123410649
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala 
---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.AnalysisException
+
+object MathUtils {
+
+  /**
+   *  Returns the bucket number into which
+   *  the value of this expression would fall after being evaluated.
+   *
+   * @param expr id the expression for which the histogram is being created
+   * @param minValue is an expression that resolves
+   * to the minimum end point of the acceptable range for 
expr
+   * @param maxValue is an expression that resolves
+   * to the maximum end point of the acceptable range for 
expr
+   * @param numBucket is an An expression that resolves to
+   *  a constant indicating the number of buckets
+   * @return Returns an long between 0 and numBucket+1 by mapping the expr 
into buckets defined by
+   * the range [minValue, maxValue]
--- End diff --

Both endpoints are included? Can you check with other databases and add a 
comment here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123402665
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -631,3 +631,109 @@ abstract class TernaryExpression extends Expression {
 }
   }
 }
+
+/**
+ * An expression with four inputs and one output. The output is by default 
evaluated to null
+ * if any input is evaluated to null.
+ */
+abstract class QuaternaryExpression extends Expression {
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def nullable: Boolean = children.exists(_.nullable)
+
+  /**
+   * Default behavior of evaluation according to the default nullability 
of TernaryExpression.
+   * If subclass of TernaryExpression override nullable, probably should 
also override this.
+   */
+  override def eval(input: InternalRow): Any = {
+val exprs = children
+val value1 = exprs(0).eval(input)
+if (value1 != null) {
+  val value2 = exprs(1).eval(input)
+  if (value2 != null) {
+val value3 = exprs(2).eval(input)
+if (value3 != null) {
+  val value4 = exprs(3).eval(input)
+  if (value4 != null) {
+return nullSafeEval(value1, value2, value3, value4)
+  }
+}
+  }
+}
+null
+  }
+
+  /**
+   * Called by default [[eval]] implementation.  If subclass of 
TernaryExpression keep the default
+   * nullability, they can override this method to save null-check code.  
If we need full control
+   * of evaluation process, we should override [[eval]].
+   */
+  protected def nullSafeEval(input1: Any, input2: Any, input3: Any, 
input4: Any): Any =
+sys.error(s"TernaryExpressions must override either eval or 
nullSafeEval")
+
+  /**
+   * Short hand for generating ternary evaluation code.
+   * If either of the sub-expressions is null, the result of this 
computation
+   * is assumed to be null.
+   *
+   * @param f accepts three variable names and returns Java code to 
compute the output.
+   */
+  protected def defineCodeGen(
+ctx: CodegenContext,
+ev: ExprCode,
+f: (String, String, String, String) => String): ExprCode = {
+nullSafeCodeGen(ctx, ev, (eval1, eval2, eval3, eval4) => {
+  s"${ev.value} = ${f(eval1, eval2, eval3, eval3)};"
+})
+  }
+
+  /**
+   * Short hand for generating ternary evaluation code.
+   * If either of the sub-expressions is null, the result of this 
computation
+   * is assumed to be null.
+   *
+   * @param f function that accepts the 3 non-null evaluation result names 
of children
--- End diff --

3 -> 4


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123408856
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala 
---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.AnalysisException
+
+object MathUtils {
+
+  /**
+   *  Returns the bucket number into which
+   *  the value of this expression would fall after being evaluated.
+   *
+   * @param expr id the expression for which the histogram is being created
--- End diff --

id -> is


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123409716
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/operators.sql ---
@@ -92,3 +92,8 @@ select abs(-3.13), abs('-2.19');
 
 -- positive/negative
 select positive('-1.11'), positive(-1.11), negative('-1.11'), 
negative(-1.11);
+
+-- width_bucket
--- End diff --

Do we need end-to-end tests here? I think we already cover these cases in 
other test suites.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123410449
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala 
---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.AnalysisException
+
+object MathUtils {
+
+  /**
+   *  Returns the bucket number into which
+   *  the value of this expression would fall after being evaluated.
+   *
+   * @param expr id the expression for which the histogram is being created
+   * @param minValue is an expression that resolves
+   * to the minimum end point of the acceptable range for 
expr
+   * @param maxValue is an expression that resolves
+   * to the maximum end point of the acceptable range for 
expr
+   * @param numBucket is an An expression that resolves to
+   *  a constant indicating the number of buckets
+   * @return Returns an long between 0 and numBucket+1 by mapping the expr 
into buckets defined by
+   * the range [minValue, maxValue]
+   */
+  def widthBucket(expr: Double, minValue: Double, maxValue: Double, 
numBucket: Long): Long = {
+
+if (numBucket <= 0) {
+  throw new AnalysisException(s"The num of bucket must be greater than 
0, but got ${numBucket}")
+}
+
+val lower: Double = Math.min(minValue, maxValue)
+val upper: Double = Math.max(minValue, maxValue)
--- End diff --

Does other databases allow max value to appear first? i.e. 
`widthBucket(3.14, 4, 0, 3, 1)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2017-06-21 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18323#discussion_r123412204
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/MathUtilsSuite.scala
 ---
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.util.MathUtils._
+
+class MathUtilsSuite extends SparkFunSuite {
+
+  test("widthBucket") {
+assert(widthBucket(5.35, 0.024, 10.06, 5) === 3)
+assert(widthBucket(99, 100, 5000, 10) === 0)
+assert(widthBucket(100, 100, 5000, 10) === 1)
+assert(widthBucket(590, 100, 5000, 10) === 2)
+assert(widthBucket(5000, 100, 5000, 10) === 11)
+assert(widthBucket(6000, 100, 5000, 10) === 11)
--- End diff --

Can you use the same cases from Oracle's doc? just to make sure we are 
getting the same results here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 >

1 - 100 of 536 matches

Mail list logo