[GitHub] spark issue #18322: [SPARK-21115][Core]If the cores left is less than the co...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/18322 @jiangxb1987 can you please help to review this PR? This is a simple code improvement to avoid some unnecessary code execution when left cores is not enough for one executor. I don't strong inclination on this PR since previous code also does correct behavior, I'd like to hear your thoughts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17113 **[Test build #78431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78431/testReport)** for PR 17113 at commit [`3cf9cfd`](https://github.com/apache/spark/commit/3cf9cfd0ef78e0d0cc2780e563968ed2bd22ac39). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17113: [SPARK-13669][SPARK-20898][Core] Improve the blacklist m...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/17113 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9518: [SPARK-11574][Core] Add metrics StatsD sink
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/9518#discussion_r123420182 --- Diff: core/src/main/scala/org/apache/spark/metrics/sink/StatsdReporter.scala --- @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.metrics.sink + +import java.io.IOException +import java.net.{DatagramPacket, DatagramSocket, InetSocketAddress} +import java.nio.charset.StandardCharsets.UTF_8 +import java.util.SortedMap +import java.util.concurrent.TimeUnit + +import scala.collection.JavaConverters._ +import scala.util.{Failure, Success, Try} + +import com.codahale.metrics._ +import org.apache.hadoop.net.NetUtils + +import org.apache.spark.Logging + +/** + * @see https://github.com/etsy/statsd/blob/master/docs/metric_types.md;> + *StatsD metric types + */ +private[spark] sealed trait StatsdMetricType { --- End diff -- This trait looks only used for defining some values, may be we could use `object StatsdReporter` instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9518: [SPARK-11574][Core] Add metrics StatsD sink
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/9518#discussion_r123425292 --- Diff: core/src/main/scala/org/apache/spark/metrics/sink/StatsdReporter.scala --- @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.metrics.sink + +import java.io.IOException +import java.net.{DatagramPacket, DatagramSocket, InetSocketAddress} +import java.nio.charset.StandardCharsets.UTF_8 +import java.util.SortedMap +import java.util.concurrent.TimeUnit + +import scala.collection.JavaConverters._ +import scala.util.{Failure, Success, Try} + +import com.codahale.metrics._ +import org.apache.hadoop.net.NetUtils + +import org.apache.spark.Logging + +/** + * @see https://github.com/etsy/statsd/blob/master/docs/metric_types.md;> + *StatsD metric types + */ +private[spark] sealed trait StatsdMetricType { + val COUNTER = "c" + val GAUGE = "g" + val TIMER = "ms" + val Set = "s" +} + +private[spark] class StatsdReporter( +registry: MetricRegistry, +host: String = "127.0.0.1", +port: Int = 8125, +prefix: String = "", +filter: MetricFilter = MetricFilter.ALL, +rateUnit: TimeUnit = TimeUnit.SECONDS, +durationUnit: TimeUnit = TimeUnit.MILLISECONDS) + extends ScheduledReporter(registry, "statsd-reporter", filter, rateUnit, durationUnit) + with StatsdMetricType with Logging { + + private val address = new InetSocketAddress(host, port) + private val whitespace = "[\\s]+".r + + override def report( + gauges: SortedMap[String, Gauge[_]], + counters: SortedMap[String, Counter], + histograms: SortedMap[String, Histogram], + meters: SortedMap[String, Meter], + timers: SortedMap[String, Timer]): Unit = +Try(new DatagramSocket) match { + case Failure(ioe: IOException) => logWarning("StatsD datagram socket construction failed", +NetUtils.wrapException(host, port, "0.0.0.0", 0, ioe)) + case Failure(e) => logWarning("StatsD datagram socket construction failed", e) + case Success(s) => +implicit val socket = s +val localAddress = Try(socket.getLocalAddress).map(_.getHostAddress).getOrElse(null) +val localPort = socket.getLocalPort +Try { + gauges.entrySet.asScala.foreach(e => reportGauge(e.getKey, e.getValue)) + counters.entrySet.asScala.foreach(e => reportCounter(e.getKey, e.getValue)) + histograms.entrySet.asScala.foreach(e => reportHistogram(e.getKey, e.getValue)) + meters.entrySet.asScala.foreach(e => reportMetered(e.getKey, e.getValue)) + timers.entrySet.asScala.foreach(e => reportTimer(e.getKey, e.getValue)) +} recover { + case ioe: IOException => +logDebug(s"Unable to send packets to StatsD", NetUtils.wrapException( + address.getHostString, address.getPort, localAddress, localPort, ioe)) + case e: Throwable => logDebug(s"Unable to send packets to StatsD at '$host:$port'", e) +} +Try(socket.close()) recover { + case ioe: IOException => +logDebug("Error when close socket to StatsD", NetUtils.wrapException( + address.getHostString, address.getPort, localAddress, localPort, ioe)) + case e: Throwable => logDebug("Error when close socket to StatsD", e) +} +} + + private def reportGauge(name: String, gauge: Gauge[_])(implicit socket: DatagramSocket) = +formatAny(gauge.getValue).foreach(v => send(fullName(name), v, GAUGE)) + + private def reportCounter(name: String, counter: Counter)(implicit socket: DatagramSocket) = +send(fullName(name), format(counter.getCount), COUNTER) + + private def reportHistogram(name: String, histogram: Histogram) + (implicit
[GitHub] spark pull request #9518: [SPARK-11574][Core] Add metrics StatsD sink
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/9518#discussion_r123424723 --- Diff: core/src/main/scala/org/apache/spark/metrics/sink/StatsdReporter.scala --- @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.metrics.sink + +import java.io.IOException +import java.net.{DatagramPacket, DatagramSocket, InetSocketAddress} +import java.nio.charset.StandardCharsets.UTF_8 +import java.util.SortedMap +import java.util.concurrent.TimeUnit + +import scala.collection.JavaConverters._ +import scala.util.{Failure, Success, Try} + +import com.codahale.metrics._ +import org.apache.hadoop.net.NetUtils + +import org.apache.spark.Logging + +/** + * @see https://github.com/etsy/statsd/blob/master/docs/metric_types.md;> + *StatsD metric types + */ +private[spark] sealed trait StatsdMetricType { + val COUNTER = "c" + val GAUGE = "g" + val TIMER = "ms" + val Set = "s" +} + +private[spark] class StatsdReporter( +registry: MetricRegistry, +host: String = "127.0.0.1", +port: Int = 8125, +prefix: String = "", +filter: MetricFilter = MetricFilter.ALL, +rateUnit: TimeUnit = TimeUnit.SECONDS, +durationUnit: TimeUnit = TimeUnit.MILLISECONDS) + extends ScheduledReporter(registry, "statsd-reporter", filter, rateUnit, durationUnit) + with StatsdMetricType with Logging { + + private val address = new InetSocketAddress(host, port) + private val whitespace = "[\\s]+".r + + override def report( + gauges: SortedMap[String, Gauge[_]], + counters: SortedMap[String, Counter], + histograms: SortedMap[String, Histogram], + meters: SortedMap[String, Meter], + timers: SortedMap[String, Timer]): Unit = +Try(new DatagramSocket) match { + case Failure(ioe: IOException) => logWarning("StatsD datagram socket construction failed", +NetUtils.wrapException(host, port, "0.0.0.0", 0, ioe)) + case Failure(e) => logWarning("StatsD datagram socket construction failed", e) + case Success(s) => +implicit val socket = s +val localAddress = Try(socket.getLocalAddress).map(_.getHostAddress).getOrElse(null) +val localPort = socket.getLocalPort +Try { + gauges.entrySet.asScala.foreach(e => reportGauge(e.getKey, e.getValue)) + counters.entrySet.asScala.foreach(e => reportCounter(e.getKey, e.getValue)) + histograms.entrySet.asScala.foreach(e => reportHistogram(e.getKey, e.getValue)) + meters.entrySet.asScala.foreach(e => reportMetered(e.getKey, e.getValue)) + timers.entrySet.asScala.foreach(e => reportTimer(e.getKey, e.getValue)) +} recover { + case ioe: IOException => +logDebug(s"Unable to send packets to StatsD", NetUtils.wrapException( + address.getHostString, address.getPort, localAddress, localPort, ioe)) + case e: Throwable => logDebug(s"Unable to send packets to StatsD at '$host:$port'", e) +} +Try(socket.close()) recover { + case ioe: IOException => +logDebug("Error when close socket to StatsD", NetUtils.wrapException( + address.getHostString, address.getPort, localAddress, localPort, ioe)) + case e: Throwable => logDebug("Error when close socket to StatsD", e) +} +} + + private def reportGauge(name: String, gauge: Gauge[_])(implicit socket: DatagramSocket) = +formatAny(gauge.getValue).foreach(v => send(fullName(name), v, GAUGE)) + + private def reportCounter(name: String, counter: Counter)(implicit socket: DatagramSocket) = +send(fullName(name), format(counter.getCount), COUNTER) + + private def reportHistogram(name: String, histogram: Histogram) + (implicit
[GitHub] spark issue #18371: [SPARK-20889][SparkR] Grouped documentation for MATH col...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18371 **[Test build #78430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78430/testReport)** for PR 18371 at commit [`707b871`](https://github.com/apache/spark/commit/707b871160574297ef8eb75859d05d9ab13df02c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18378#discussion_r123425404 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1721,7 +1721,14 @@ def toPandas(self): 15Bob """ import pandas as pd -return pd.DataFrame.from_records(self.collect(), columns=self.columns) + +dtype = {} +for field in self.schema: +pandas_type = _to_corrected_pandas_type(field.dataType) +if (pandas_type): +dtype[field.name] = pandas_type + +return pd.DataFrame.from_records(self.collect(), columns=self.columns).astype(dtype) --- End diff -- The param `copy` of `astype` is true by default. Seems to me we don't need copying the data? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18371: [SPARK-20889][SparkR] Grouped documentation for M...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/18371#discussion_r123425200 --- Diff: R/pkg/R/functions.R --- @@ -34,6 +34,30 @@ NULL #' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))} NULL +#' Math functions for Column operations +#' +#' Math functions defined for \code{Column}. +#' +#' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} and \code{shiftRightUnsigned}, +#' this is the number of bits to shift. +#' @param y Column to compute on. +#' @param ... additional argument(s). +#' @name column_math_functions +#' @rdname column_math_functions +#' @family math functions +#' @examples +#' \dontrun{ +#' # Dataframe used throughout this doc +#' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars)) +#' tmp <- mutate(df, v1 = log(df$mpg), v2 = cbrt(df$disp), +#' v3 = bround(df$wt, 1), v4 = bin(df$cyl), +#' v5 = hex(df$wt), v6 = toDegrees(df$gear), +#' v7 = atan2(df$cyl, df$am), v8 = hypot(df$cyl, df$am), +#' v9 = pmod(df$hp, df$cyl), v10 = shiftLeft(df$disp, 1), +#' v11 = conv(df$hp, 10, 16)) --- End diff -- Three more examples added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18371: [SPARK-20889][SparkR] Grouped documentation for MATH col...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/18371 Made another commit that addresses your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18371: [SPARK-20889][SparkR] Grouped documentation for M...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/18371#discussion_r123425179 --- Diff: R/pkg/R/functions.R --- @@ -1405,18 +1309,12 @@ setMethod("sha1", column(jc) }) -#' signum -#' -#' Computes the signum of the given value. -#' -#' @param x Column to compute on. +#' @details +#' \code{signum}: Computes the signum of the given value. --- End diff -- OK. fixed this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18378 **[Test build #78429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78429/testReport)** for PR 18378 at commit [`1e98c49`](https://github.com/apache/spark/commit/1e98c494e0c414ca218b029bfc1a9d9faf3c2960). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18378 It sounds ok to me just except missing `_have_pandas = False` above `try:` . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17519: [SPARK-15352][Doc] follow-up: add configuration d...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17519#discussion_r123424505 --- Diff: docs/configuration.md --- @@ -1004,14 +1004,48 @@ Apart from these, the following properties are also available, and may be useful - spark.storage.replication.proactive + spark.storage.replication.proactive false Enables proactive block replication for RDD blocks. Cached RDD block replicas lost due to executor failures are replenished if there are any existing available replicas. This tries to get the replication level of the block to the initial number. + + spark.storage.replication.policy + +org.apache.spark.storage.RandomBlockReplicationPolicy + + +The policy to use for choosing peers when replicating blocks. The default policy would randomly +choose the peers to replicate to. A more resilient replication policy is provided by +org.apache.spark.storage.BasicBlockReplicationPolicy, which makes use of the +topology information of the hosts to choose the peers, much like the HDFS blocks replication +strategy: it would try to choose the first replica within the same rack, and a third replica on +a different rack. See spark.storage.replication.topologyMapper below for how to +provide the topology information for the hosts. + + + + spark.storage.replication.topologyMapper + +org.apache.spark.storage.DefaultTopologyMapper + + +The topology information of a host is determined by a topology mapping service defined by the +abstract class org.apache.spark.storage.TopologyMapper, which can be configured by +this property. A default implementation that assumes all hosts are in the same rack is provided +by org.apache.spark.storage.DefaultTopologyMapper. A file-based implementation is +provided by org.apache.spark.storage.FileBasedTopologyMapper, which reads the +topology information from the file org.apache.spark.storage.topologyFile. Each line +of this file is of the format of host1 = /rack1 and provides a mapping from a host +name to its rack information. Note: This configuration only takes effect when +spark.storage.replication.policy is set to a a policy that takes the topology --- End diff -- nit: double `a` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17519: [SPARK-15352][Doc] follow-up: add configuration d...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17519#discussion_r123424493 --- Diff: docs/configuration.md --- @@ -1004,14 +1004,48 @@ Apart from these, the following properties are also available, and may be useful - spark.storage.replication.proactive + spark.storage.replication.proactive false Enables proactive block replication for RDD blocks. Cached RDD block replicas lost due to executor failures are replenished if there are any existing available replicas. This tries to get the replication level of the block to the initial number. + + spark.storage.replication.policy + +org.apache.spark.storage.RandomBlockReplicationPolicy + + +The policy to use for choosing peers when replicating blocks. The default policy would randomly +choose the peers to replicate to. A more resilient replication policy is provided by +org.apache.spark.storage.BasicBlockReplicationPolicy, which makes use of the +topology information of the hosts to choose the peers, much like the HDFS blocks replication +strategy: it would try to choose the first replica within the same rack, and a third replica on +a different rack. See spark.storage.replication.topologyMapper below for how to +provide the topology information for the hosts. + + + + spark.storage.replication.topologyMapper + +org.apache.spark.storage.DefaultTopologyMapper + + +The topology information of a host is determined by a topology mapping service defined by the +abstract class org.apache.spark.storage.TopologyMapper, which can be configured by +this property. A default implementation that assumes all hosts are in the same rack is provided +by org.apache.spark.storage.DefaultTopologyMapper. A file-based implementation is +provided by org.apache.spark.storage.FileBasedTopologyMapper, which reads the +topology information from the file org.apache.spark.storage.topologyFile. Each line --- End diff -- shall we also add an entry for `org.apache.spark.storage.topologyFile`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18382 **[Test build #78428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78428/testReport)** for PR 18382 at commit [`9eaa6d6`](https://github.com/apache/spark/commit/9eaa6d689be820eeb02170ad2191b9547af3aa15). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18382 Thank you @felixcheung. I checked docs and the same thing in the PR description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18162: [SPARK-20923] turn tracking of TaskMetrics._updat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18162#discussion_r123424121 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -295,4 +295,12 @@ package object config { "above this threshold. This is to avoid a giant request takes too much memory.") .bytesConf(ByteUnit.BYTE) .createWithDefaultString("200m") + + private[spark] val TASK_METRICS_TRACK_UPDATED_BLOCK_STATUSES = +ConfigBuilder("spark.taskMetrics.trackUpdatedBlockStatuses") --- End diff -- you can document it in `configuration.md`. Actually can we turn it off by default? I think this is feature is useless for most of users. cc @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18378 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78427/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18378 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18378 **[Test build #78427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78427/testReport)** for PR 18378 at commit [`36dc5e7`](https://github.com/apache/spark/commit/36dc5e7df4549270e66b33d4d171898e8b21faae). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18162: [SPARK-20923] turn tracking of TaskMetrics._updat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18162#discussion_r123424029 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala --- @@ -528,7 +528,13 @@ class JobProgressListener(conf: SparkConf) extends SparkListener with Logging { new StageUIData }) val taskData = stageData.taskData.get(taskId) - val metrics = TaskMetrics.fromAccumulatorInfos(accumUpdates) + val accumsFiltered = if (conf.get(TASK_METRICS_TRACK_UPDATED_BLOCK_STATUSES)) { +accumUpdates + } else { +accumUpdates.filter(info => info.name.isDefined && info.update.isDefined && info.name != --- End diff -- that also works --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18377: [SPARK-18016][SQL][CATALYST][BRANCH-2.2] Code Generation...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18377 thanks, merging to 2.2! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18354: [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18354 thanks, merging to 2.1! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18362: [SPARK-20832][Core] Standalone master should expl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18362#discussion_r123423356 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -454,6 +464,12 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp }(ThreadUtils.sameThread) } + protected def removeWorker(workerId: String, host: String, message: String): Unit = { +driverEndpoint.ask[Boolean](RemoveWorker(workerId, host, message)).onFailure { case t => +logError(t.getMessage, t) --- End diff -- style nit: `case t => ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18362: [SPARK-20832][Core] Standalone master should expl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18362#discussion_r123423287 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -569,6 +569,12 @@ private[spark] class TaskSchedulerImpl private[scheduler]( } } + override def workerRemoved(workerId: String, host: String, message: String): Unit = { +logInfo(s"Handle removed worker $workerId: $message") +dagScheduler.workerRemoved(workerId, host, message) +backend.reviveOffers() --- End diff -- why this line? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18362: [SPARK-20832][Core] Standalone master should expl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18362#discussion_r123423207 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1432,6 +1439,26 @@ class DAGScheduler( } } + /** + * Responds to a worker being removed. This is called inside the event loop, so it assumes it can + * modify the scheduler's internal state. Use workerRemoved() to post a loss event from outside. + * + * We will assume that we've lost all shuffle blocks associated with the host if a worker is --- End diff -- what if the worker is replaced? https://github.com/apache/spark/pull/18362/files#diff-29dffdccd5a7f4c8b496c293e87c8668R759 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18384: [SPARK-21170] [CORE] Utils.tryWithSafeFinallyAndFailureC...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18384 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18378 **[Test build #78427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78427/testReport)** for PR 18378 at commit [`36dc5e7`](https://github.com/apache/spark/commit/36dc5e7df4549270e66b33d4d171898e8b21faae). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18362: [SPARK-20832][Core] Standalone master should expl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18362#discussion_r123423005 --- Diff: core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala --- @@ -158,6 +158,8 @@ private[deploy] object DeployMessages { case class ApplicationRemoved(message: String) + case class WorkerRemoved(id: String, host: String, message: String) --- End diff -- do we still need the `ExecutorUpdated.workerLost`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18384: [SPARK-21170] [CORE] Utils.tryWithSafeFinallyAndF...
GitHub user devaraj-kavali opened a pull request: https://github.com/apache/spark/pull/18384 [SPARK-21170] [CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted ## What changes were proposed in this pull request? Not adding the exception to the suppressed if it is the same instance as originalThrowable. ## How was this patch tested? Added new tests to verify this, these tests fail without source code changes and passes with the change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/devaraj-kavali/spark SPARK-21170 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18384.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18384 commit d3f8846dbbccb52318d1f71cd4e5ecd352ae738e Author: Devaraj KDate: 2017-06-22T05:04:08Z [SPARK-21170] [CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18118: [SPARK-20199][ML] : Provided featureSubsetStrateg...
Github user pralabhkumar commented on a diff in the pull request: https://github.com/apache/spark/pull/18118#discussion_r123422892 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -192,6 +196,9 @@ object GBTClassifier extends DefaultParamsReadable[GBTClassifier] { @Since("2.0.0") override def load(path: String): GBTClassifier = super.load(path) + + final val supportedFeatureSubsetStrategies: Array[String] = --- End diff -- @sethah Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15417: [SPARK-17851][SQL][TESTS] Make sure all test sqls...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15417#discussion_r12342 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisTest.scala --- @@ -55,6 +55,14 @@ trait AnalysisTest extends PlanTest { comparePlans(actualPlan, expectedPlan) } + protected override def comparePlans( + plan1: LogicalPlan, + plan2: LogicalPlan, + checkAnalysis: Boolean = false): Unit = { +// Analysis tests may have not been fully resolved, so skip checkAnalysis. --- End diff -- this comment explains why we set the default value of `checkAnalysis` to false, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18378 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78426/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18378 **[Test build #78426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78426/testReport)** for PR 18378 at commit [`36f9cb6`](https://github.com/apache/spark/commit/36f9cb63f21600db4a95ce05d370a72245649100). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18378 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18378: [SPARK-21163][SQL] DataFrame.toPandas should respect the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18378 **[Test build #78426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78426/testReport)** for PR 18378 at commit [`36f9cb6`](https://github.com/apache/spark/commit/36f9cb63f21600db4a95ce05d370a72245649100). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18118 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18118 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78420/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18382: [SPARK-21149][R] Add job description API for R
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18382#discussion_r123420890 --- Diff: R/pkg/R/context.R --- @@ -295,6 +295,23 @@ setCheckpointDirSC <- function(sc, dirName) { invisible(callJMethod(sc, "setCheckpointDir", suppressWarnings(normalizePath(dirName } +#' Set a human readable description of the current job. +#' +#' Set a description that is shown as a job description in UI. +#' +#' @rdname setJobDescription +#' @param value The job description of the current job. +#' @export +#' @examples +#'\dontrun{ +#' setJobDescription("This is an example job.") +#'} +#' @note setJobDescription since 2.3.0 +setJobDescription <- function(value) { --- End diff -- and this should probably go to https://github.com/apache/spark/blob/422aa67d1bb84f913b06e6d94615adb6557e2870/R/pkg/R/sparkR.R#L536 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18382: [SPARK-21149][R] Add job description API for R
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18382#discussion_r123420666 --- Diff: R/pkg/NAMESPACE --- @@ -403,6 +403,7 @@ export("as.DataFrame", "refreshTable", "setCheckpointDir", "setCurrentDatabase", + "setJobDescription", --- End diff -- I think this should go to https://github.com/HyukjinKwon/spark/blob/c591e63d5937403bb462bdf76795af8e3f324aca/R/pkg/NAMESPACE#L78 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18322#discussion_r123420930 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -258,23 +256,7 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S if (mainClass == null && SparkSubmit.isUserJar(primaryResource)) { SparkSubmit.printErrorAndExit("No main class set in JAR; please specify one with --class") } -if (driverMemory != null -&& Try(JavaUtils.byteStringAsBytes(driverMemory)).getOrElse(-1L) <= 0) { - SparkSubmit.printErrorAndExit("Driver Memory must be a positive number") -} -if (executorMemory != null -&& Try(JavaUtils.byteStringAsBytes(executorMemory)).getOrElse(-1L) <= 0) { - SparkSubmit.printErrorAndExit("Executor Memory cores must be a positive number") -} -if (executorCores != null && Try(executorCores.toInt).getOrElse(-1) <= 0) { - SparkSubmit.printErrorAndExit("Executor cores must be a positive number") -} -if (totalExecutorCores != null && Try(totalExecutorCores.toInt).getOrElse(-1) <= 0) { - SparkSubmit.printErrorAndExit("Total executor cores must be a positive number") -} -if (numExecutors != null && Try(numExecutors.toInt).getOrElse(-1) <= 0) { - SparkSubmit.printErrorAndExit("Number of executors must be a positive number") -} --- End diff -- @jerryshao Ok ,I have moved them back, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18118 **[Test build #78420 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78420/testReport)** for PR 18118 at commit [`7970293`](https://github.com/apache/spark/commit/79702933d321051222073057b25305831df84c6d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18322#discussion_r123420644 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -543,6 +545,42 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria } } +if (contains("spark.driver.memory")) { + val driverMemory = get("spark.driver.memory") + if (Try(JavaUtils.byteStringAsBytes(driverMemory)).getOrElse(-1L) <= 0) { +throw new IllegalArgumentException(s"spark.driver.memory " + + s"(was ${driverMemory}) can only be a positive number") + } +} +if (contains("spark.executor.memory")) { + val executorMemory = get("spark.executor.memory") + if (Try(JavaUtils.byteStringAsBytes(executorMemory)).getOrElse(-1L) <= 0) { +throw new IllegalArgumentException(s"spark.executor.memory " + + s"(was ${executorMemory}) can only be a positive number") + } --- End diff -- @jerryshao Ok ,I have moved them back, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18300: [SPARK-21043][SQL] Add unionByName in Dataset
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18300 **[Test build #78425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78425/testReport)** for PR 18300 at commit [`b17a14e`](https://github.com/apache/spark/commit/b17a14e3d8c752df09c6dbd32681a6352b185116). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18347: [SPARK-20599][SS] ConsoleSink should work with (batch)
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18347 **[Test build #78424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78424/testReport)** for PR 18347 at commit [`5851d22`](https://github.com/apache/spark/commit/5851d226fd1531447882b9dab62728e5ce36e486). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18300: [SPARK-21043][SQL] Add unionByName in Dataset
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/18300#discussion_r123420282 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1764,6 +1765,70 @@ class Dataset[T] private[sql]( } /** + * Returns a new Dataset containing union of rows in this Dataset and another Dataset. + * + * This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set + * union (that does deduplication of elements), use this function followed by a [[distinct]]. + * + * The difference between this function and [[union]] is that this function + * resolves columns by name (not by position): + * + * {{{ + * val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") + * val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0") + * df1.unionByName(df2).show + * + * // output: + * // ++++ + * // |col0|col1|col2| + * // ++++ + * // | 1| 2| 3| + * // | 6| 4| 5| + * // ++++ + * }}} + * + * @group typedrel + * @since 2.3.0 + */ + def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator { +// Resolves children first to reorder output attributes in `other` by name +val leftPlan = sparkSession.sessionState.executePlan(logicalPlan) +val rightPlan = sparkSession.sessionState.executePlan(other.logicalPlan) --- End diff -- yea, it seems we needn't. removed. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18322#discussion_r123420083 --- Diff: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala --- @@ -704,6 +707,43 @@ class MasterSuite extends SparkFunSuite private def getState(master: Master): RecoveryState.Value = { master.invokePrivate(_state()) } + + test("Total cores is not divisible by cores per executor") { --- End diff -- @jerryshao Ok, I have removed them out, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18320: [SPARK-21093][R] Terminate R's worker processes i...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18320#discussion_r123417186 --- Diff: R/pkg/inst/worker/daemon.R --- @@ -47,9 +79,11 @@ while (TRUE) { close(inputCon) --- End diff -- please add comment here to say "# reach here because this is a child process" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18320: [SPARK-21093][R] Terminate R's worker processes i...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18320#discussion_r123416154 --- Diff: R/pkg/inst/worker/daemon.R --- @@ -30,8 +30,40 @@ port <- as.integer(Sys.getenv("SPARKR_WORKER_PORT")) inputCon <- socketConnection( port = port, open = "rb", blocking = TRUE, timeout = connectionTimeout) +# Waits indefinitely for a socket connecion by default. +selectTimeout <- NULL + while (TRUE) { - ready <- socketSelect(list(inputCon)) + ready <- socketSelect(list(inputCon), timeout = selectTimeout) --- End diff -- hmm, this is pre-existing behavior, but generally I think waiting indefinitely for anything would be rather dangerous. probably would be good to follow up separately --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18320: [SPARK-21093][R] Terminate R's worker processes i...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18320#discussion_r123416771 --- Diff: R/pkg/inst/worker/daemon.R --- @@ -30,8 +30,40 @@ port <- as.integer(Sys.getenv("SPARKR_WORKER_PORT")) inputCon <- socketConnection( port = port, open = "rb", blocking = TRUE, timeout = connectionTimeout) +# Waits indefinitely for a socket connecion by default. +selectTimeout <- NULL + while (TRUE) { - ready <- socketSelect(list(inputCon)) + ready <- socketSelect(list(inputCon), timeout = selectTimeout) + + # Note that the children should be terminated in the parent. If each child terminates + # itself, it appears that the resource is not released properly, that causes an unexpected + # termination of this daemon due to, for example, running out of file descriptors + # (see SPARK-21093). Therefore, the current implementation tries to retrieve children + # that are exited (but not terminated) and then sends a kill signal to terminate them properly + # in the parent. + # + # There are two paths that it attempts to send a signal to terminate the children in the parent. + # + # 1. Every second if any socket connection is not available and if there are child workers + # running. + # 2. Right after a socket connection is available. + # + # In other words, the parent attempts to send the signal to the children every second if + # any worker is running or right before launching other worker children from the following + # new socket connection. + + # Only the process IDs of exited children are returned and the termination is attempted below. + children <- parallel:::selectChildren(timeout = 0) + if (is.integer(children)) { +# If it is PIDs, there are workers exited but not terminated. Attempts to terminate them --- End diff -- If I understanding correctly, children (from `parallel:::selectChildren()`) as integer only indicates the list of fork/child process. But it does not indicate the child process "exited"? When we are in a loop checking every second (ie. selectTimeout == 1), it sounds to me like socketSelect could return in one of the following ways: - socket available for reading (ready == TRUE) - socket not available for reading, it's dead or exiting (ready == FALSE) - socket not available for reading, but it's not ready to connect because say things are running slow or the host is busy etc, and it has effectively been just timed out since selectTimeout (ready == FALSE) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18320: [SPARK-21093][R] Terminate R's worker processes i...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18320#discussion_r123419616 --- Diff: R/pkg/inst/worker/daemon.R --- @@ -30,8 +30,40 @@ port <- as.integer(Sys.getenv("SPARKR_WORKER_PORT")) inputCon <- socketConnection( port = port, open = "rb", blocking = TRUE, timeout = connectionTimeout) +# Waits indefinitely for a socket connecion by default. +selectTimeout <- NULL + while (TRUE) { - ready <- socketSelect(list(inputCon)) + ready <- socketSelect(list(inputCon), timeout = selectTimeout) + + # Note that the children should be terminated in the parent. If each child terminates + # itself, it appears that the resource is not released properly, that causes an unexpected + # termination of this daemon due to, for example, running out of file descriptors + # (see SPARK-21093). Therefore, the current implementation tries to retrieve children + # that are exited (but not terminated) and then sends a kill signal to terminate them properly + # in the parent. + # + # There are two paths that it attempts to send a signal to terminate the children in the parent. + # + # 1. Every second if any socket connection is not available and if there are child workers + # running. + # 2. Right after a socket connection is available. + # + # In other words, the parent attempts to send the signal to the children every second if + # any worker is running or right before launching other worker children from the following + # new socket connection. + + # Only the process IDs of exited children are returned and the termination is attempted below. + children <- parallel:::selectChildren(timeout = 0) + if (is.integer(children)) { +# If it is PIDs, there are workers exited but not terminated. Attempts to terminate them --- End diff -- right, I see your reference here https://github.com/apache/spark/pull/18320#discussion_r122639738 but I'm not 100% getting it when looking at the source code https://github.com/s-u/multicore/blob/e9d9bf21e6cf08e24cfe54d762379b4fa923765b/src/fork.c#L361 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18114#discussion_r123419433 --- Diff: R/pkg/R/functions.R --- @@ -2414,20 +2396,23 @@ setMethod("from_json", signature(x = "Column", schema = "structType"), column(jc) }) -#' from_utc_timestamp -#' -#' Given a timestamp, which corresponds to a certain time of day in UTC, returns another timestamp -#' that corresponds to the same time of day in the given timezone. +#' @details +#' \code{from_utc_timestamp}: Given a timestamp, which corresponds to a certain time of day in UTC, +#' returns another timestamp that corresponds to the same time of day in the given timezone. #' -#' @param y Column to compute on. -#' @param x time zone to use. +#' @rdname column_datetime_diff_functions #' -#' @family date time functions -#' @rdname from_utc_timestamp -#' @name from_utc_timestamp -#' @aliases from_utc_timestamp,Column,character-method +#' @aliases from_utc_timestamp from_utc_timestamp,Column,character-method #' @export -#' @examples \dontrun{from_utc_timestamp(df$t, 'PST')} +#' @examples +#' +#' \dontrun{ +#' tmp <- mutate(df, from_utc = from_utc_timestamp(df$time, 'PST'), +#' to_utc = to_utc_timestamp(df$time, 'PST'), +#' to_unix = unix_timestamp(df$time), +#' to_unix2 = unix_timestamp(df$time, '-MM-dd HH'), +#' from_unix = from_unixtime(unix_timestamp(df$time))) --- End diff -- Actually, what I found was the documentation for `unix_timestamp` and `from_unixtime` was in `column_datetime_functions` but the examples in `column_datetime_diff_functions`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18114#discussion_r123419080 --- Diff: R/pkg/R/functions.R --- @@ -2414,20 +2396,23 @@ setMethod("from_json", signature(x = "Column", schema = "structType"), column(jc) }) -#' from_utc_timestamp -#' -#' Given a timestamp, which corresponds to a certain time of day in UTC, returns another timestamp -#' that corresponds to the same time of day in the given timezone. +#' @details +#' \code{from_utc_timestamp}: Given a timestamp, which corresponds to a certain time of day in UTC, +#' returns another timestamp that corresponds to the same time of day in the given timezone. #' -#' @param y Column to compute on. -#' @param x time zone to use. +#' @rdname column_datetime_diff_functions #' -#' @family date time functions -#' @rdname from_utc_timestamp -#' @name from_utc_timestamp -#' @aliases from_utc_timestamp,Column,character-method +#' @aliases from_utc_timestamp from_utc_timestamp,Column,character-method #' @export -#' @examples \dontrun{from_utc_timestamp(df$t, 'PST')} +#' @examples +#' +#' \dontrun{ +#' tmp <- mutate(df, from_utc = from_utc_timestamp(df$time, 'PST'), +#' to_utc = to_utc_timestamp(df$time, 'PST'), +#' to_unix = unix_timestamp(df$time), +#' to_unix2 = unix_timestamp(df$time, '-MM-dd HH'), +#' from_unix = from_unixtime(unix_timestamp(df$time))) --- End diff -- this I don't get - why `from_unixtime` belongs to `column_datetime_diff_functions`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18329: [SPARK-19909][SS] Disabling the usage of a tempor...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18329#discussion_r123418793 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala --- @@ -264,12 +281,12 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) { df, sink, outputMode, -useTempCheckpointLocation = true, +useTempCheckpointLocation = isTempCheckpointLocationAvailable, trigger = trigger) } else { val (useTempCheckpointLocation, recoverFromCheckpointLocation) = if (source == "console") { - (true, false) + (isTempCheckpointLocationAvailable, false) --- End diff -- @mgaido91 AFAIK whether `useTempCheckpointLocation` is `true` or `false` is based on the type of `Sink`, here with your change, now the semantics are changing to wether `tmpFs` equals `defaultFs` or `defaultFs` is local FS. So looks like the semantics are different now. I'm not if it is a valid fix. @zsxwing would you please help to review this patch? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18327: [SPARK-21047] Add test suites for complicated cases in C...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18327 **[Test build #78423 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78423/testReport)** for PR 18327 at commit [`9319a2f`](https://github.com/apache/spark/commit/9319a2f73cca767b7ec3e4abb1d213bbe535d122). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18114#discussion_r123416302 --- Diff: R/pkg/R/functions.R --- @@ -2414,20 +2396,23 @@ setMethod("from_json", signature(x = "Column", schema = "structType"), column(jc) }) -#' from_utc_timestamp -#' -#' Given a timestamp, which corresponds to a certain time of day in UTC, returns another timestamp -#' that corresponds to the same time of day in the given timezone. +#' @details +#' \code{from_utc_timestamp}: Given a timestamp, which corresponds to a certain time of day in UTC, +#' returns another timestamp that corresponds to the same time of day in the given timezone. #' -#' @param y Column to compute on. -#' @param x time zone to use. +#' @rdname column_datetime_diff_functions #' -#' @family date time functions -#' @rdname from_utc_timestamp -#' @name from_utc_timestamp -#' @aliases from_utc_timestamp,Column,character-method +#' @aliases from_utc_timestamp from_utc_timestamp,Column,character-method #' @export -#' @examples \dontrun{from_utc_timestamp(df$t, 'PST')} +#' @examples +#' +#' \dontrun{ +#' tmp <- mutate(df, from_utc = from_utc_timestamp(df$time, 'PST'), +#' to_utc = to_utc_timestamp(df$time, 'PST'), +#' to_unix = unix_timestamp(df$time), +#' to_unix2 = unix_timestamp(df$time, '-MM-dd HH'), +#' from_unix = from_unixtime(unix_timestamp(df$time))) --- End diff -- It looks this one should go to `column_datetime_functions` or `from_unixtime` should be in `column_datetime_diff_functions`. `unix_timestamp` looks a ditto. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18114#discussion_r123416414 --- Diff: R/pkg/R/functions.R --- @@ -2414,20 +2396,23 @@ setMethod("from_json", signature(x = "Column", schema = "structType"), column(jc) }) -#' from_utc_timestamp -#' -#' Given a timestamp, which corresponds to a certain time of day in UTC, returns another timestamp -#' that corresponds to the same time of day in the given timezone. +#' @details +#' \code{from_utc_timestamp}: Given a timestamp, which corresponds to a certain time of day in UTC, +#' returns another timestamp that corresponds to the same time of day in the given timezone. #' -#' @param y Column to compute on. -#' @param x time zone to use. +#' @rdname column_datetime_diff_functions #' -#' @family date time functions -#' @rdname from_utc_timestamp -#' @name from_utc_timestamp -#' @aliases from_utc_timestamp,Column,character-method +#' @aliases from_utc_timestamp from_utc_timestamp,Column,character-method #' @export -#' @examples \dontrun{from_utc_timestamp(df$t, 'PST')} +#' @examples +#' +#' \dontrun{ +#' tmp <- mutate(df, from_utc = from_utc_timestamp(df$time, 'PST'), +#' to_utc = to_utc_timestamp(df$time, 'PST'), +#' to_unix = unix_timestamp(df$time), +#' to_unix2 = unix_timestamp(df$time, '-MM-dd HH'), +#' from_unix = from_unixtime(unix_timestamp(df$time))) --- End diff -- And ... `from_unixtime(df$t, '/MM/dd HH')` looks missed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18114#discussion_r123417000 --- Diff: R/pkg/R/functions.R --- @@ -34,6 +34,58 @@ NULL #' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))} NULL +#' Date time functions for Column operations +#' +#' Date time functions defined for \code{Column}. +#' +#' @param x Column to compute on. +#' @param format For \code{to_date} and \code{to_timestamp}, it is the string to use to parse +#' x Column to DateType or TimestampType. For \code{trunc}, it is the string used +#' for specifying the truncation method. For example, "year", "", "yy" for +#' truncate by year, or "month", "mon", "mm" for truncate by month. +#' @param ... additional argument(s). +#' @name column_datetime_functions +#' @rdname column_datetime_functions +#' @family data time functions +#' @examples +#' \dontrun{ +#' dts <- c("2005-01-02 18:47:22", +#' "2005-12-24 16:30:58", +#' "2005-10-28 07:30:05", +#' "2005-12-28 07:01:05", +#' "2006-01-24 00:01:10") +#' y <- c(2.0, 2.2, 3.4, 2.5, 1.8) +#' df <- createDataFrame(data.frame(time = as.POSIXct(dts), y = y))} +NULL + +#' Date time arithmetic functions for Column operations +#' +#' Date time arithmetic functions defined for \code{Column}. +#' +#' @param y Column to compute on. +#' @param x For class \code{Column}, it is the column used to perform arithmetic operations +#' with column \code{y}.For class \code{numeric}, it is the number of months or +#' days to be added to or subtracted from \code{y}. For class \code{character}, it is +#' \itemize{ +#' \item \code{date_format}: date format specification. +#' \item \code{from_utc_timestamp, to_utc_timestamp}: time zone to use. --- End diff -- little nit `\code{from_utc_timestamp}, \code{to_utc_timestamp}` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18114: [SPARK-20889][SparkR] Grouped documentation for D...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18114#discussion_r123416202 --- Diff: R/pkg/R/functions.R --- @@ -34,6 +34,58 @@ NULL #' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))} NULL +#' Date time functions for Column operations +#' +#' Date time functions defined for \code{Column}. +#' +#' @param x Column to compute on. +#' @param format For \code{to_date} and \code{to_timestamp}, it is the string to use to parse +#' x Column to DateType or TimestampType. For \code{trunc}, it is the string used +#' for specifying the truncation method. For example, "year", "", "yy" for +#' truncate by year, or "month", "mon", "mm" for truncate by month. +#' @param ... additional argument(s). +#' @name column_datetime_functions +#' @rdname column_datetime_functions +#' @family data time functions +#' @examples +#' \dontrun{ +#' dts <- c("2005-01-02 18:47:22", +#' "2005-12-24 16:30:58", +#' "2005-10-28 07:30:05", +#' "2005-12-28 07:01:05", +#' "2006-01-24 00:01:10") +#' y <- c(2.0, 2.2, 3.4, 2.5, 1.8) +#' df <- createDataFrame(data.frame(time = as.POSIXct(dts), y = y))} +NULL + +#' Date time arithmetic functions for Column operations +#' +#' Date time arithmetic functions defined for \code{Column}. +#' +#' @param y Column to compute on. +#' @param x For class \code{Column}, it is the column used to perform arithmetic operations +#' with column \code{y}.For class \code{numeric}, it is the number of months or --- End diff -- little nit `.F` ->`. F` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18329: [SPARK-19909][SS] Disabling the usage of a tempor...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18329#discussion_r123417606 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala --- @@ -235,6 +237,21 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) { "write files of Hive data source directly.") } +val hadoopConf = df.sparkSession.sessionState.newHadoopConf() +val defaultFS = FileSystem.getDefaultUri(hadoopConf).getScheme +val tmpFS = new URI(System.getProperty("java.io.tmpdir")).getScheme + +val isTempCheckpointLocationAvailable = tmpFS match { + case null | "file" => +if (defaultFS == null || defaultFS.equals("file")) { + true +} else { + false +} + case defaultFS => true --- End diff -- @mgaido91 , if defaultFS is HDFS, in which scenario `tmpFS` ("java.io.tmpdir") be the same as `defaultFS`? From my understanding, unless we change "java.io.tmpdir" property, then `tmpFs` will always be local FS. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18329: [SPARK-19909][SS] Disabling the usage of a tempor...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18329#discussion_r123416837 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala --- @@ -17,8 +17,10 @@ package org.apache.spark.sql.streaming +import java.net.URI import java.util.Locale +import org.apache.hadoop.fs.FileSystem --- End diff -- The import ordering is not correct, third-party packages should be put under Scala packages. You can locally verify the style. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18300: [SPARK-21043][SQL] Add unionByName in Dataset
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/18300#discussion_r123416779 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1764,6 +1765,70 @@ class Dataset[T] private[sql]( } /** + * Returns a new Dataset containing union of rows in this Dataset and another Dataset. + * + * This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set + * union (that does deduplication of elements), use this function followed by a [[distinct]]. + * + * The difference between this function and [[union]] is that this function + * resolves columns by name (not by position): + * + * {{{ + * val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") + * val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0") + * df1.unionByName(df2).show + * + * // output: + * // ++++ + * // |col0|col1|col2| + * // ++++ + * // | 1| 2| 3| + * // | 6| 4| 5| + * // ++++ + * }}} + * + * @group typedrel + * @since 2.3.0 + */ + def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator { +// Resolves children first to reorder output attributes in `other` by name +val leftPlan = sparkSession.sessionState.executePlan(logicalPlan) +val rightPlan = sparkSession.sessionState.executePlan(other.logicalPlan) +leftPlan.assertAnalyzed() +rightPlan.assertAnalyzed() + +// Check column name duplication +val resolver = sparkSession.sessionState.analyzer.resolver +val leftOutputAttrs = leftPlan.analyzed.output +val rightOutputAttrs = rightPlan.analyzed.output +// SchemaUtils.checkColumnNameDuplication( +// leftOutputAttrs.map(_.name), +// "in the left attributes", +// sparkSession.sessionState.conf.caseSensitiveAnalysis) +// SchemaUtils.checkColumnNameDuplication( +// rightOutputAttrs.map(_.name), +// "in the right attributes", +// sparkSession.sessionState.conf.caseSensitiveAnalysis) --- End diff -- The function to check name duplication is discussed in #17758. I'm planning to use the func to check the duplication and then do union-by. See the discussion: https://github.com/apache/spark/pull/18300#discussion_r122116812 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17713: [SPARK-20417][SQL] Move subquery error handling to check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17713 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78418/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18228: [SPARK-21007][SQL]Add SQL function - RIGHT && LEFT
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18228 **[Test build #78422 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78422/testReport)** for PR 18228 at commit [`fca984d`](https://github.com/apache/spark/commit/fca984d762577f61561c5da5c983f61b22e7757a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17713: [SPARK-20417][SQL] Move subquery error handling to check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17713 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17713: [SPARK-20417][SQL] Move subquery error handling to check...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17713 **[Test build #78418 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78418/testReport)** for PR 17713 at commit [`c2a7555`](https://github.com/apache/spark/commit/c2a7555703c254f524645c0136dd10c2827e1d83). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18114: [SPARK-20889][SparkR] Grouped documentation for DATETIME...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18114 @felixcheung, would you give me a moment to double check? I am interested in this and want to help double check. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18382 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78421/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18382 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18382 **[Test build #78421 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78421/testReport)** for PR 18382 at commit [`c591e63`](https://github.com/apache/spark/commit/c591e63d5937403bb462bdf76795af8e3f324aca). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18383: [SPARK-21167][SS] Set kafka clientId while fetch message...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18383 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18322#discussion_r123415327 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -543,6 +545,42 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria } } +if (contains("spark.driver.memory")) { + val driverMemory = get("spark.driver.memory") + if (Try(JavaUtils.byteStringAsBytes(driverMemory)).getOrElse(-1L) <= 0) { +throw new IllegalArgumentException(s"spark.driver.memory " + + s"(was ${driverMemory}) can only be a positive number") + } +} +if (contains("spark.executor.memory")) { + val executorMemory = get("spark.executor.memory") + if (Try(JavaUtils.byteStringAsBytes(executorMemory)).getOrElse(-1L) <= 0) { +throw new IllegalArgumentException(s"spark.executor.memory " + + s"(was ${executorMemory}) can only be a positive number") + } --- End diff -- This above two checks seems unnecessary, let's not change unrelated code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18322#discussion_r123415688 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -258,23 +256,7 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S if (mainClass == null && SparkSubmit.isUserJar(primaryResource)) { SparkSubmit.printErrorAndExit("No main class set in JAR; please specify one with --class") } -if (driverMemory != null -&& Try(JavaUtils.byteStringAsBytes(driverMemory)).getOrElse(-1L) <= 0) { - SparkSubmit.printErrorAndExit("Driver Memory must be a positive number") -} -if (executorMemory != null -&& Try(JavaUtils.byteStringAsBytes(executorMemory)).getOrElse(-1L) <= 0) { - SparkSubmit.printErrorAndExit("Executor Memory cores must be a positive number") -} -if (executorCores != null && Try(executorCores.toInt).getOrElse(-1) <= 0) { - SparkSubmit.printErrorAndExit("Executor cores must be a positive number") -} -if (totalExecutorCores != null && Try(totalExecutorCores.toInt).getOrElse(-1) <= 0) { - SparkSubmit.printErrorAndExit("Total executor cores must be a positive number") -} -if (numExecutors != null && Try(numExecutors.toInt).getOrElse(-1) <= 0) { - SparkSubmit.printErrorAndExit("Number of executors must be a positive number") -} --- End diff -- The above changes are valid and useful, I'd suggest to not change it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18128: [SPARK-20906][SparkR]:Constrained Logistic Regres...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18128 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18128 merged to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18383: [SPARK-21167][SS] Set kafka clientId while fetch ...
GitHub user dijingran opened a pull request: https://github.com/apache/spark/pull/18383 [SPARK-21167][SS] Set kafka clientId while fetch messages ## What changes were proposed in this pull request? Change KafkaRDD to set kafka clientId while fetch messages, as in our case ,we use clientId at kafka server side. ## How was this patch tested? All tests still passed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dijingran/spark branch-2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18383.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18383 commit 917eda2d98ea584e2a64fcc3b942a23b206c290f Author: dixingxingDate: 2017-06-22T03:29:52Z Set kafka clientId while fatch messages --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18114: [SPARK-20889][SparkR] Grouped documentation for DATETIME...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18114 hmm, waiting for AppVeyor --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18322: [SPARK-21115][Core]If the cores left is less than...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18322#discussion_r123413523 --- Diff: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala --- @@ -704,6 +707,43 @@ class MasterSuite extends SparkFunSuite private def getState(master: Master): RecoveryState.Value = { master.invokePrivate(_state()) } + + test("Total cores is not divisible by cores per executor") { --- End diff -- I see, if there's no better way to verify it, I think it is not useful to add this two UTs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18382: [SPARK-21149][R] Add job description API for R
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18382 **[Test build #78421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78421/testReport)** for PR 18382 at commit [`c591e63`](https://github.com/apache/spark/commit/c591e63d5937403bb462bdf76795af8e3f324aca). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18382: [SPARK-21149][R] Add job description API for R
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18382#discussion_r123412816 --- Diff: R/pkg/R/context.R --- @@ -295,6 +295,23 @@ setCheckpointDirSC <- function(sc, dirName) { invisible(callJMethod(sc, "setCheckpointDir", suppressWarnings(normalizePath(dirName } +#' Set a human readable description of the current job. +#' +#' Set a description that is shown as a job description in UI. +#' +#' @rdname setJobDescription +#' @param value The job description of the current job. +#' @export +#' @examples +#'\dontrun{ +#' setJobDescription("This is an example job.") +#'} +#' @note setJobDescription since 2.3.0 --- End diff -- Just for sure, ![2017-06-22 12 06 27](https://user-images.githubusercontent.com/6477701/27415694-54b2889c-5743-11e7-89c5-2d7e82e0babe.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18362: [SPARK-20832][Core] Standalone master should explicitly ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18362 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78415/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18327: [SPARK-21047] Add test suites for complicated cases in C...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18327 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78416/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18362: [SPARK-20832][Core] Standalone master should explicitly ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18362 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18327: [SPARK-21047] Add test suites for complicated cases in C...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18327 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18362: [SPARK-20832][Core] Standalone master should explicitly ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18362 **[Test build #78415 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78415/testReport)** for PR 18362 at commit [`fb72fb6`](https://github.com/apache/spark/commit/fb72fb6ea8dc02753d8c699f985afc70d1ccd971). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18327: [SPARK-21047] Add test suites for complicated cases in C...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18327 **[Test build #78416 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78416/testReport)** for PR 18327 at commit [`78102f3`](https://github.com/apache/spark/commit/78102f37c884b765596a7308563239acf45ed9b0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18300: [SPARK-21043][SQL] Add unionByName in Dataset
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18300#discussion_r123412488 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1764,6 +1765,70 @@ class Dataset[T] private[sql]( } /** + * Returns a new Dataset containing union of rows in this Dataset and another Dataset. + * + * This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set + * union (that does deduplication of elements), use this function followed by a [[distinct]]. + * + * The difference between this function and [[union]] is that this function + * resolves columns by name (not by position): + * + * {{{ + * val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") + * val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0") + * df1.unionByName(df2).show + * + * // output: + * // ++++ + * // |col0|col1|col2| + * // ++++ + * // | 1| 2| 3| + * // | 6| 4| 5| + * // ++++ + * }}} + * + * @group typedrel + * @since 2.3.0 + */ + def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator { +// Resolves children first to reorder output attributes in `other` by name +val leftPlan = sparkSession.sessionState.executePlan(logicalPlan) +val rightPlan = sparkSession.sessionState.executePlan(other.logicalPlan) +leftPlan.assertAnalyzed() +rightPlan.assertAnalyzed() + +// Check column name duplication +val resolver = sparkSession.sessionState.analyzer.resolver +val leftOutputAttrs = leftPlan.analyzed.output +val rightOutputAttrs = rightPlan.analyzed.output +// SchemaUtils.checkColumnNameDuplication( +// leftOutputAttrs.map(_.name), +// "in the left attributes", +// sparkSession.sessionState.conf.caseSensitiveAnalysis) +// SchemaUtils.checkColumnNameDuplication( +// rightOutputAttrs.map(_.name), +// "in the right attributes", +// sparkSession.sessionState.conf.caseSensitiveAnalysis) --- End diff -- Why above? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123408655 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala --- @@ -1186,3 +1186,51 @@ case class BRound(child: Expression, scale: Expression) with Serializable with ImplicitCastInputTypes { def this(child: Expression) = this(child, Literal(0)) } + +/** + * The bucket number into which --- End diff -- nit: Returns the bucket number --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123409892 --- Diff: sql/core/src/test/resources/sql-tests/inputs/operators.sql --- @@ -92,3 +92,8 @@ select abs(-3.13), abs('-2.19'); -- positive/negative select positive('-1.11'), positive(-1.11), negative('-1.11'), negative(-1.11); + +-- width_bucket --- End diff -- Instead, we need to add a test for sql queries using this function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123401825 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -631,3 +631,109 @@ abstract class TernaryExpression extends Expression { } } } + +/** + * An expression with four inputs and one output. The output is by default evaluated to null + * if any input is evaluated to null. + */ +abstract class QuaternaryExpression extends Expression { + + override def foldable: Boolean = children.forall(_.foldable) + + override def nullable: Boolean = children.exists(_.nullable) + + /** + * Default behavior of evaluation according to the default nullability of TernaryExpression. + * If subclass of TernaryExpression override nullable, probably should also override this. + */ + override def eval(input: InternalRow): Any = { +val exprs = children +val value1 = exprs(0).eval(input) +if (value1 != null) { + val value2 = exprs(1).eval(input) + if (value2 != null) { +val value3 = exprs(2).eval(input) +if (value3 != null) { + val value4 = exprs(3).eval(input) + if (value4 != null) { +return nullSafeEval(value1, value2, value3, value4) + } +} + } +} +null + } + + /** + * Called by default [[eval]] implementation. If subclass of TernaryExpression keep the default + * nullability, they can override this method to save null-check code. If we need full control + * of evaluation process, we should override [[eval]]. + */ + protected def nullSafeEval(input1: Any, input2: Any, input3: Any, input4: Any): Any = +sys.error(s"TernaryExpressions must override either eval or nullSafeEval") + + /** + * Short hand for generating ternary evaluation code. + * If either of the sub-expressions is null, the result of this computation + * is assumed to be null. + * + * @param f accepts three variable names and returns Java code to compute the output. + */ + protected def defineCodeGen( +ctx: CodegenContext, +ev: ExprCode, +f: (String, String, String, String) => String): ExprCode = { +nullSafeCodeGen(ctx, ev, (eval1, eval2, eval3, eval4) => { + s"${ev.value} = ${f(eval1, eval2, eval3, eval3)};" --- End diff -- the last `eval3` -> `eval4` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123410671 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.sql.AnalysisException + +object MathUtils { + + /** + * Returns the bucket number into which + * the value of this expression would fall after being evaluated. + * + * @param expr id the expression for which the histogram is being created + * @param minValue is an expression that resolves + * to the minimum end point of the acceptable range for expr + * @param maxValue is an expression that resolves + * to the maximum end point of the acceptable range for expr + * @param numBucket is an An expression that resolves to + * a constant indicating the number of buckets + * @return Returns an long between 0 and numBucket+1 by mapping the expr into buckets defined by + * the range [minValue, maxValue] + */ + def widthBucket(expr: Double, minValue: Double, maxValue: Double, numBucket: Long): Long = { + +if (numBucket <= 0) { + throw new AnalysisException(s"The num of bucket must be greater than 0, but got ${numBucket}") +} + +val lower: Double = Math.min(minValue, maxValue) +val upper: Double = Math.max(minValue, maxValue) + +val preResult: Long = if (expr < lower) { + 0 +} else if (expr >= upper) { + Math.addExact(numBucket, 1) +} else { + (numBucket.toDouble * (expr - lower) / (upper - lower) + 1).toLong +} + +val result = if (minValue > maxValue) (numBucket - preResult) + 1 else preResult +result --- End diff -- nit: `if (minValue > maxValue) (numBucket - preResult) + 1 else preResult` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123411306 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.sql.AnalysisException + +object MathUtils { + + /** + * Returns the bucket number into which + * the value of this expression would fall after being evaluated. + * + * @param expr id the expression for which the histogram is being created + * @param minValue is an expression that resolves + * to the minimum end point of the acceptable range for expr + * @param maxValue is an expression that resolves + * to the maximum end point of the acceptable range for expr + * @param numBucket is an An expression that resolves to + * a constant indicating the number of buckets + * @return Returns an long between 0 and numBucket+1 by mapping the expr into buckets defined by + * the range [minValue, maxValue] + */ + def widthBucket(expr: Double, minValue: Double, maxValue: Double, numBucket: Long): Long = { + +if (numBucket <= 0) { + throw new AnalysisException(s"The num of bucket must be greater than 0, but got ${numBucket}") +} + +val lower: Double = Math.min(minValue, maxValue) +val upper: Double = Math.max(minValue, maxValue) + +val preResult: Long = if (expr < lower) { + 0 +} else if (expr >= upper) { + Math.addExact(numBucket, 1) +} else { + (numBucket.toDouble * (expr - lower) / (upper - lower) + 1).toLong --- End diff -- what if upper == lower? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123411090 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.sql.AnalysisException + +object MathUtils { + + /** + * Returns the bucket number into which + * the value of this expression would fall after being evaluated. + * + * @param expr id the expression for which the histogram is being created + * @param minValue is an expression that resolves + * to the minimum end point of the acceptable range for expr + * @param maxValue is an expression that resolves + * to the maximum end point of the acceptable range for expr + * @param numBucket is an An expression that resolves to + * a constant indicating the number of buckets + * @return Returns an long between 0 and numBucket+1 by mapping the expr into buckets defined by + * the range [minValue, maxValue] + */ + def widthBucket(expr: Double, minValue: Double, maxValue: Double, numBucket: Long): Long = { + +if (numBucket <= 0) { + throw new AnalysisException(s"The num of bucket must be greater than 0, but got ${numBucket}") +} + +val lower: Double = Math.min(minValue, maxValue) +val upper: Double = Math.max(minValue, maxValue) + +val preResult: Long = if (expr < lower) { + 0 +} else if (expr >= upper) { + Math.addExact(numBucket, 1) --- End diff -- Do we really need to use `addExact`? In Oracle's doc, `numBucket` is an integer, then we can use `numBucket + 1L` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123410649 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.sql.AnalysisException + +object MathUtils { + + /** + * Returns the bucket number into which + * the value of this expression would fall after being evaluated. + * + * @param expr id the expression for which the histogram is being created + * @param minValue is an expression that resolves + * to the minimum end point of the acceptable range for expr + * @param maxValue is an expression that resolves + * to the maximum end point of the acceptable range for expr + * @param numBucket is an An expression that resolves to + * a constant indicating the number of buckets + * @return Returns an long between 0 and numBucket+1 by mapping the expr into buckets defined by + * the range [minValue, maxValue] --- End diff -- Both endpoints are included? Can you check with other databases and add a comment here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123402665 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -631,3 +631,109 @@ abstract class TernaryExpression extends Expression { } } } + +/** + * An expression with four inputs and one output. The output is by default evaluated to null + * if any input is evaluated to null. + */ +abstract class QuaternaryExpression extends Expression { + + override def foldable: Boolean = children.forall(_.foldable) + + override def nullable: Boolean = children.exists(_.nullable) + + /** + * Default behavior of evaluation according to the default nullability of TernaryExpression. + * If subclass of TernaryExpression override nullable, probably should also override this. + */ + override def eval(input: InternalRow): Any = { +val exprs = children +val value1 = exprs(0).eval(input) +if (value1 != null) { + val value2 = exprs(1).eval(input) + if (value2 != null) { +val value3 = exprs(2).eval(input) +if (value3 != null) { + val value4 = exprs(3).eval(input) + if (value4 != null) { +return nullSafeEval(value1, value2, value3, value4) + } +} + } +} +null + } + + /** + * Called by default [[eval]] implementation. If subclass of TernaryExpression keep the default + * nullability, they can override this method to save null-check code. If we need full control + * of evaluation process, we should override [[eval]]. + */ + protected def nullSafeEval(input1: Any, input2: Any, input3: Any, input4: Any): Any = +sys.error(s"TernaryExpressions must override either eval or nullSafeEval") + + /** + * Short hand for generating ternary evaluation code. + * If either of the sub-expressions is null, the result of this computation + * is assumed to be null. + * + * @param f accepts three variable names and returns Java code to compute the output. + */ + protected def defineCodeGen( +ctx: CodegenContext, +ev: ExprCode, +f: (String, String, String, String) => String): ExprCode = { +nullSafeCodeGen(ctx, ev, (eval1, eval2, eval3, eval4) => { + s"${ev.value} = ${f(eval1, eval2, eval3, eval3)};" +}) + } + + /** + * Short hand for generating ternary evaluation code. + * If either of the sub-expressions is null, the result of this computation + * is assumed to be null. + * + * @param f function that accepts the 3 non-null evaluation result names of children --- End diff -- 3 -> 4 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123408856 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.sql.AnalysisException + +object MathUtils { + + /** + * Returns the bucket number into which + * the value of this expression would fall after being evaluated. + * + * @param expr id the expression for which the histogram is being created --- End diff -- id -> is --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123409716 --- Diff: sql/core/src/test/resources/sql-tests/inputs/operators.sql --- @@ -92,3 +92,8 @@ select abs(-3.13), abs('-2.19'); -- positive/negative select positive('-1.11'), positive(-1.11), negative('-1.11'), negative(-1.11); + +-- width_bucket --- End diff -- Do we need end-to-end tests here? I think we already cover these cases in other test suites. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123410449 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MathUtils.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.sql.AnalysisException + +object MathUtils { + + /** + * Returns the bucket number into which + * the value of this expression would fall after being evaluated. + * + * @param expr id the expression for which the histogram is being created + * @param minValue is an expression that resolves + * to the minimum end point of the acceptable range for expr + * @param maxValue is an expression that resolves + * to the maximum end point of the acceptable range for expr + * @param numBucket is an An expression that resolves to + * a constant indicating the number of buckets + * @return Returns an long between 0 and numBucket+1 by mapping the expr into buckets defined by + * the range [minValue, maxValue] + */ + def widthBucket(expr: Double, minValue: Double, maxValue: Double, numBucket: Long): Long = { + +if (numBucket <= 0) { + throw new AnalysisException(s"The num of bucket must be greater than 0, but got ${numBucket}") +} + +val lower: Double = Math.min(minValue, maxValue) +val upper: Double = Math.max(minValue, maxValue) --- End diff -- Does other databases allow max value to appear first? i.e. `widthBucket(3.14, 4, 0, 3, 1)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18323#discussion_r123412204 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/MathUtilsSuite.scala --- @@ -0,0 +1,37 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.util.MathUtils._ + +class MathUtilsSuite extends SparkFunSuite { + + test("widthBucket") { +assert(widthBucket(5.35, 0.024, 10.06, 5) === 3) +assert(widthBucket(99, 100, 5000, 10) === 0) +assert(widthBucket(100, 100, 5000, 10) === 1) +assert(widthBucket(590, 100, 5000, 10) === 2) +assert(widthBucket(5000, 100, 5000, 10) === 11) +assert(widthBucket(6000, 100, 5000, 10) === 11) --- End diff -- Can you use the same cases from Oracle's doc? just to make sure we are getting the same results here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org