[GitHub] spark pull request #19984: [SPARK-22789] Map-only continuous processing exec...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/19984#discussion_r158116796 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala --- @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming.continuous + +import java.util.concurrent.atomic.AtomicLong + +import scala.collection.mutable + +import org.apache.spark.SparkEnv +import org.apache.spark.internal.Logging +import org.apache.spark.rpc.{RpcCallContext, RpcEndpointRef, RpcEnv, ThreadSafeRpcEndpoint} +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper +import org.apache.spark.sql.sources.v2.reader.{ContinuousReader, PartitionOffset} +import org.apache.spark.sql.sources.v2.writer.{ContinuousWriter, WriterCommitMessage} +import org.apache.spark.util.RpcUtils + +private[continuous] sealed trait EpochCoordinatorMessage extends Serializable + +// Driver epoch trigger message +/** + * Atomically increment the current epoch and get the new value. + */ +private[sql] case class IncrementAndGetEpoch() extends EpochCoordinatorMessage + +// Init messages +/** + * Set the reader and writer partition counts. Tasks may not be started until the coordinator + * has acknowledged these messages. + */ +private[sql] case class SetReaderPartitions(numPartitions: Int) extends EpochCoordinatorMessage +case class SetWriterPartitions(numPartitions: Int) extends EpochCoordinatorMessage + +// Partition task messages +/** + * Get the current epoch. + */ +private[sql] case class GetCurrentEpoch() extends EpochCoordinatorMessage --- End diff -- nit: `case class` -> `case object` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19940 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85202/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19940 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19940 **[Test build #85202 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85202/testReport)** for PR 19940 at commit [`c650196`](https://github.com/apache/spark/commit/c650196a4c0c4391d3112678803bd601af9aa5fb). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19993 thanks @hhbyyh, I updated the PR according to your suggestion and previous comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19156: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interfa...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19156#discussion_r158138431 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala --- @@ -35,237 +34,252 @@ class SummarizerSuite extends SparkFunSuite with MLlibTestSparkContext { import SummaryBuilderImpl._ private case class ExpectedMetrics( - mean: Seq[Double], - variance: Seq[Double], + mean: Vector, + variance: Vector, count: Long, - numNonZeros: Seq[Long], - max: Seq[Double], - min: Seq[Double], - normL2: Seq[Double], - normL1: Seq[Double]) + numNonZeros: Vector, + max: Vector, + min: Vector, + normL2: Vector, + normL1: Vector) /** - * The input is expected to be either a sparse vector, a dense vector or an array of doubles - * (which will be converted to a dense vector) - * The expected is the list of all the known metrics. + * The input is expected to be either a sparse vector, a dense vector. * - * The tests take an list of input vectors and a list of all the summary values that - * are expected for this input. They currently test against some fixed subset of the - * metrics, but should be made fuzzy in the future. + * The tests take an list of input vectors, and compare results with + * `mllib.stat.MultivariateOnlineSummarizer`. They currently test against some fixed subset + * of the metrics, but should be made fuzzy in the future. */ - private def testExample(name: String, input: Seq[Any], exp: ExpectedMetrics): Unit = { + private def testExample(name: String, inputVec: Seq[(Vector, Double)], + exp: ExpectedMetrics, expWithoutWeight: ExpectedMetrics): Unit = { -def inputVec: Seq[Vector] = input.map { - case x: Array[Double @unchecked] => Vectors.dense(x) - case x: Seq[Double @unchecked] => Vectors.dense(x.toArray) - case x: Vector => x - case x => throw new Exception(x.toString) +val summarizer = { + val _summarizer = new MultivariateOnlineSummarizer + inputVec.foreach(v => _summarizer.add(OldVectors.fromML(v._1), v._2)) + _summarizer } -val summarizer = { +val summarizerWithoutWeight = { val _summarizer = new MultivariateOnlineSummarizer - inputVec.foreach(v => _summarizer.add(OldVectors.fromML(v))) + inputVec.foreach(v => _summarizer.add(OldVectors.fromML(v._1))) _summarizer } // Because the Spark context is reset between tests, we cannot hold a reference onto it. def wrappedInit() = { - val df = inputVec.map(Tuple1.apply).toDF("features") - val col = df.col("features") - (df, col) + val df = inputVec.toDF("features", "weight") + val featuresCol = df.col("features") + val weightCol = df.col("weight") + (df, featuresCol, weightCol) } registerTest(s"$name - mean only") { - val (df, c) = wrappedInit() - compare(df.select(metrics("mean").summary(c), mean(c)), Seq(Row(exp.mean), summarizer.mean)) + val (df, c, w) = wrappedInit() + compareRow(df.select(metrics("mean").summary(c, w), mean(c, w)).first(), +Row(Row(summarizer.mean), exp.mean)) } -registerTest(s"$name - mean only (direct)") { - val (df, c) = wrappedInit() - compare(df.select(mean(c)), Seq(exp.mean)) +registerTest(s"$name - mean only w/o weight") { + val (df, c, _) = wrappedInit() + compareRow(df.select(metrics("mean").summary(c), mean(c)).first(), +Row(Row(summarizerWithoutWeight.mean), expWithoutWeight.mean)) } registerTest(s"$name - variance only") { - val (df, c) = wrappedInit() - compare(df.select(metrics("variance").summary(c), variance(c)), -Seq(Row(exp.variance), summarizer.variance)) + val (df, c, w) = wrappedInit() + compareRow(df.select(metrics("variance").summary(c, w), variance(c, w)).first(), +Row(Row(summarizer.variance), exp.variance)) } -registerTest(s"$name - variance only (direct)") { - val (df, c) = wrappedInit() - compare(df.select(variance(c)), Seq(summarizer.variance)) +registerTest(s"$name - variance only w/o weight") { + val (df, c, _) = wrappedInit() + compareRow(df.select(metrics("variance").summary(c), variance(c)).first(), +Row(Row(summarizerWithoutWeight.variance), expWithoutWeight.variance)) } registerTest(s"$name - count only") { - val (df, c) = wrappedInit() - compare(df.select(metrics("count").s
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19993 **[Test build #85209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85209/testReport)** for PR 19993 at commit [`9872bfd`](https://github.com/apache/spark/commit/9872bfdb5a428a74ba387c5d86a271621fef0a04). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/20023 In am generally in favor of following the SQL standard. How about we do this. Let's make the standard behavior the default, and add a flag to revert to the old behavior. This allows us to ease users into the new behavior, and for us it can provide some data points on when we can remove the old behavior. I hope we can remove this for Spark 2.4 or later. At the end of the day it will be a bit more work, as I'd definitely would make an effort to isolate the the two behaviors as much as possible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...
Github user foxish commented on the issue: https://github.com/apache/spark/pull/20032 > (And what happens if set to 0/-ve ?) We have a check preventing that in the option itself. The value should be strictly greater than 0 ms. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20032 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85208/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20032 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...
Github user foxish commented on the issue: https://github.com/apache/spark/pull/20032 Within the allocator's control loop, it's all asynchronous requests being made for executor pods from the k8s API, so, each loop doesn't take very long. If a user were to set a very low value for the delay, it wouldn't necessarily make more requests because the creation is still [rate limited](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L111) by the time it takes for the executors launched in the previous round to become ready (which is around 1s typically, if not higher). So, in the worst case, we may end up polling the state of our internal data structures quickly adding to the driver's CPU load, but the data structures are updated async by watches, so, it doesn't increase the load on the K8s API server or have to fetch any state over the network. I can add a caveat to the documentation that setting it to a very low number of ms may increase load on th e driver. If the intent with reducing batch internal is for people looking to spin up N executors as quickly as possible, if the resources are present, I'd recommend that they use the `batch.size` parameter instead and set that equal to num_executors. That would realize the same effect, and schedule all executors in a single loop. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20032 **[Test build #85208 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85208/testReport)** for PR 20032 at commit [`4adb04b`](https://github.com/apache/spark/commit/4adb04bc9d1e3a27e64f7b92b83e6885ba3a04dc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158134032 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm +downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. +To improve performance you can create single parquet file under each partition directory using 'repartition' +on partitioned key for Hive table. When you add partition to table, there will be change in table DDL. +Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET; + */ +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) + .partitionBy("key").parquet(hiveExternalTableLocation) + +/* + You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal + data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without --- End diff -- Sentences need some cleanup here. What do you mean by 'Int' argument? maybe it's best to point people to the API docs rather than incompletely repeat it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158133877 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm --- End diff -- This is more stuff that should go in docs, not comments in an example. It kind of duplicates existing documentation. Is this commentary really needed to illustrate usage of the API? that's the only goal right here. What are small-small files? You have some inconsistent capitalization; Parquet should be capitalized but not file, bandwidth, etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158133606 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* --- End diff -- Oh just noticed this. You're using javadoc style comments here, but they won't have effect. just use the `//` block style for comments that you see above, for consistency. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20016 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20016 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20032 **[Test build #85208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85208/testReport)** for PR 20032 at commit [`4adb04b`](https://github.com/apache/spark/commit/4adb04bc9d1e3a27e64f7b92b83e6885ba3a04dc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20002 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85200/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20002 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20002 **[Test build #85200 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85200/testReport)** for PR 20002 at commit [`ca6aa08`](https://github.com/apache/spark/commit/ca6aa08e3d2f6a053992fb31faed35baa46fb5a6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r158131498 --- Diff: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala --- @@ -2022,12 +1947,7 @@ private[spark] object JsonProtocolSuite extends Assertions { | "Port": 300 |}, |"Block ID": "rdd_0_0", - |"Storage Level": { --- End diff -- I guess I mean, doesn't this no longer test whether it can read the verbose, old style format? like this test does here and the ones above, that are being removed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/20032#discussion_r158131312 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala --- @@ -217,7 +217,7 @@ private[spark] class KubernetesClusterSchedulerBackend( .watch(new ExecutorPodsWatcher())) allocatorExecutor.scheduleWithFixedDelay( - allocatorRunnable, 0L, podAllocationInterval, TimeUnit.SECONDS) + allocatorRunnable, 0L, podAllocationInterval.toLong, TimeUnit.MILLISECONDS) --- End diff -- Done. That's much better, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20029 This indeed is the primary change as it's open vs master. https://github.com/apache/spark/pull/19989 had some concerns about whether this affects correctness though? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19876 **[Test build #85207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85207/testReport)** for PR 19876 at commit [`b8844c7`](https://github.com/apache/spark/commit/b8844c75b7b0278f19cd340c3c036935c43feef4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/20032#discussion_r158129065 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala --- @@ -217,7 +217,7 @@ private[spark] class KubernetesClusterSchedulerBackend( .watch(new ExecutorPodsWatcher())) allocatorExecutor.scheduleWithFixedDelay( - allocatorRunnable, 0L, podAllocationInterval, TimeUnit.SECONDS) + allocatorRunnable, 0L, podAllocationInterval.toLong, TimeUnit.MILLISECONDS) --- End diff -- Why not use `conf.getTimeAsMs` for `podAllocationInterval` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20039: [SPARK-22850][core] Ensure queued events are delivered t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20039 **[Test build #85206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85206/testReport)** for PR 20039 at commit [`80b900a`](https://github.com/apache/spark/commit/80b900a3e4d26cc982df914ece79e5435b7ff5df). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20039: [SPARK-22850][core] Ensure queued events are deli...
GitHub user vanzin opened a pull request: https://github.com/apache/spark/pull/20039 [SPARK-22850][core] Ensure queued events are delivered to all event queues. The code in LiveListenerBus was queueing events before start in the queues themselves; so in situations like the following: bus.post(someEvent) bus.addToEventLogQueue(listener) bus.start() "someEvent" would not be delivered to "listener" if that was the first listener in the queue, because the queue wouldn't exist when the event was posted. This change buffers the events before starting the bus in the bus itself, so that they can be delivered to all registered queues when the bus is started. Also tweaked the unit tests to cover the behavior above. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vanzin/spark SPARK-22850 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20039.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20039 commit 80b900a3e4d26cc982df914ece79e5435b7ff5df Author: Marcelo Vanzin Date: 2017-12-20T20:17:43Z [SPARK-22850][core] Ensure queued events are delivered to all event queues. The code in LiveListenerBus was queueing events before start in the queues themselves; so in situations like the following: bus.post(someEvent) bus.addToEventLogQueue(listener) bus.start() "someEvent" would not be delivered to "listener" if that was the first listener in the queue. This change buffers the events before starting the bus in the bus itself, so that they can be delivered to all registered queues when the bus is started. Also tweaked the unit tests to cover the behavior above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML expo...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r158126871 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with Logging { this } + // override for Java compatibility + override def session(sparkSession: SparkSession): this.type = super.session(sparkSession) + + // override for Java compatibility + override def context(sqlContext: SQLContext): this.type = super.session(sqlContext.sparkSession) +} + +/** + * A ML Writer which delegates based on the requested format. + */ +class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging { + private var source: String = "internal" + /** - * Overwrites if the output path already exists. + * Specifies the format of ML export (e.g. PMML, internal, or + * the fully qualified class name for export). */ - @Since("1.6.0") - def overwrite(): this.type = { -shouldOverwrite = true + @Since("2.3.0") + def format(source: String): this.type = { +this.source = source this } + /** + * Dispatches the save to the correct MLFormat. + */ + @Since("2.3.0") + @throws[IOException]("If the input path already exists but overwrite is not enabled.") + @throws[SparkException]("If multiple sources for a given short name format are found.") + override protected def saveImpl(path: String) = { +val loader = Utils.getContextOrSparkClassLoader +val serviceLoader = ServiceLoader.load(classOf[MLFormatRegister], loader) +val stageName = stage.getClass.getName +val targetName = s"${source}+${stageName}" +val formats = serviceLoader.asScala.toList +val shortNames = formats.map(_.shortName()) +val writerCls = formats.filter(_.shortName().equalsIgnoreCase(targetName)) match { + // requested name did not match any given registered alias + case Nil => +Try(loader.loadClass(source)) match { + case Success(writer) => +// Found the ML writer using the fully qualified path +writer + case Failure(error) => +throw new SparkException( + s"Could not load requested format $source for $stageName ($targetName) had $formats" + + s"supporting $shortNames", error) +} + case head :: Nil => +head.getClass + case _ => +// Multiple sources +throw new SparkException( + s"Multiple writers found for $source+$stageName, try using the class name of the writer") +} +if (classOf[MLWriterFormat].isAssignableFrom(writerCls)) { + val writer = writerCls.newInstance().asInstanceOf[MLWriterFormat] + writer.write(path, sparkSession, optionMap, stage) +} else { + throw new SparkException("ML source $source is not a valid MLWriterFormat") --- End diff -- Good catch, I've added a test for this error message. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/19788#discussion_r158124202 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java --- @@ -203,22 +203,23 @@ private ShuffleMetrics() { this.appId = appId; this.execId = execId; String[] blockId0Parts = blockIds[0].split("_"); - if (blockId0Parts.length != 4 || !blockId0Parts[0].equals("shuffle")) { + if (blockId0Parts.length != 5 || !blockId0Parts[0].equals("shuffle")) { --- End diff -- This format change can cause incompatibility between shuffle service and spark application - causing a restart of the cluster and update of all spark applications I wish we had a better way to encode this information which was not so brittle. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/19788#discussion_r158121392 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java --- @@ -165,13 +165,23 @@ public ManagedBuffer getBlockData( String execId, int shuffleId, int mapId, - int reduceId) { + int reduceId, + int length) { --- End diff -- Please rename the variable - `length` is incorrect (here and other places), please rename to make it clear : `numBlocks `perhaps ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/19788#discussion_r158120856 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java --- @@ -203,22 +203,23 @@ private ShuffleMetrics() { this.appId = appId; this.execId = execId; String[] blockId0Parts = blockIds[0].split("_"); - if (blockId0Parts.length != 4 || !blockId0Parts[0].equals("shuffle")) { + if (blockId0Parts.length != 5 || !blockId0Parts[0].equals("shuffle")) { throw new IllegalArgumentException("Unexpected shuffle block id format: " + blockIds[0]); } this.shuffleId = Integer.parseInt(blockId0Parts[1]); - mapIdAndReduceIds = new int[2 * blockIds.length]; + mapIdAndReduceIds = new int[3 * blockIds.length]; --- End diff -- Please update description of the variable as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/19788#discussion_r158123441 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -812,10 +812,10 @@ private[spark] object MapOutputTracker extends Logging { logError(errorMessage) throw new MetadataFetchFailedException(shuffleId, startPartition, errorMessage) } else { -for (part <- startPartition until endPartition) { - splitsByAddress.getOrElseUpdate(status.location, ArrayBuffer()) += -((ShuffleBlockId(shuffleId, mapId, part), status.getSizeForBlock(part))) -} +val totalSize: Long = (startPartition until endPartition).map(status.getSizeForBlock).sum +splitsByAddress.getOrElseUpdate(status.location, ArrayBuffer()) += + ((ShuffleBlockId(shuffleId, mapId, startPartition, endPartition - startPartition), +totalSize)) --- End diff -- This is going to create some very heavy shuffle fetches - and looks incorrect. This merge should not be happening here, but in `ShuffleBlockFetcherIterator` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/19788#discussion_r158124569 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java --- @@ -165,13 +165,23 @@ public ManagedBuffer getBlockData( String execId, int shuffleId, int mapId, - int reduceId) { + int reduceId, + int length) { ExecutorShuffleInfo executor = executors.get(new AppExecId(appId, execId)); if (executor == null) { throw new RuntimeException( String.format("Executor is not registered (appId=%s, execId=%s)", appId, execId)); } -return getSortBasedShuffleBlockData(executor, shuffleId, mapId, reduceId); +return getSortBasedShuffleBlockData(executor, shuffleId, mapId, reduceId, length); + } + + public ManagedBuffer getBlockData( + String appId, + String execId, + int shuffleId, + int mapId, + int reduceId) { --- End diff -- Remove this method ? We dont need it anymore --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/19788#discussion_r158124746 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java --- @@ -59,9 +59,9 @@ public int getSize() { /** * Get index offset for a particular reducer. */ - public ShuffleIndexRecord getIndex(int reduceId) { + public ShuffleIndexRecord getIndex(int reduceId, int length) { --- End diff -- perhaps `require` that length (number of Blocks) is >= 1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20037: [SPARK-22849] ivy.retrieve pattern should also consider ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20037 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85197/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20037: [SPARK-22849] ivy.retrieve pattern should also consider ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20037 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20037: [SPARK-22849] ivy.retrieve pattern should also consider ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20037 **[Test build #85197 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85197/testReport)** for PR 20037 at commit [`331ba33`](https://github.com/apache/spark/commit/331ba338ce020a927fcfd88b3bd7e536fc8d3b66). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20002: [SPARK-22465][Core][WIP] Add a safety-check to RD...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/20002#discussion_r158108582 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -57,7 +60,8 @@ object Partitioner { def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val rdds = (Seq(rdd) ++ others) val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0)) -if (hasPartitioner.nonEmpty) { +if (hasPartitioner.nonEmpty + && isEligiblePartitioner(hasPartitioner.maxBy(_.partitions.length), rdds)) { --- End diff -- `hasPartitioner.maxBy(_.partitions.length)` is used repeatedly, pull that into a variable ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20002: [SPARK-22465][Core][WIP] Add a safety-check to RD...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/20002#discussion_r158119432 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -67,6 +71,16 @@ object Partitioner { } } } + + /** + * Returns true if the number of partitions of the RDD is either greater than or is + * less than and within a single order of magnitude of the max number of upstream partitions; + * otherwise, returns false + */ + private def isEligiblePartitioner(hasMaxPartitioner: RDD[_], rdds: Seq[RDD[_]]): Boolean = { +val maxPartitions = rdds.map(_.partitions.length).max +log10(maxPartitions).floor - log10(hasMaxPartitioner.getNumPartitions).floor < 1 --- End diff -- Why `.floor` ? It causes unnecessary discontinuity imo, for example: (9, 11) will not satisfy - but it should. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20035 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85198/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20035 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20035 **[Test build #85198 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85198/testReport)** for PR 20035 at commit [`f0163e7`](https://github.com/apache/spark/commit/f0163e7b68aa09fef5c1dc7f25e00170354a1ab2). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19884 **[Test build #85205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85205/testReport)** for PR 19884 at commit [`faa9f09`](https://github.com/apache/spark/commit/faa9f09faabdc8047b58283f9101142eecf1c754). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML expo...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r158114844 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with Logging { this } + // override for Java compatibility + override def session(sparkSession: SparkSession): this.type = super.session(sparkSession) + + // override for Java compatibility + override def context(sqlContext: SQLContext): this.type = super.session(sqlContext.sparkSession) +} + +/** + * A ML Writer which delegates based on the requested format. + */ +class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging { --- End diff -- So I don't think that belongs in the base GeneralMLWriter, but we could make a trait for writers which support PMML to mix in? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19876 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19876 **[Test build #85204 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85204/testReport)** for PR 19876 at commit [`6e9cdc3`](https://github.com/apache/spark/commit/6e9cdc37d63af8631e2595a6e20b67f495758d39). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19876 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85204/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19993 To make it available for other classes, we need to support checking for both `fit` and `transform`, that means we also need a sample input Dataset, so we may have to add the explicit test in each of the test suite. But we can still create some infrastructure function for the explicit test to invoke. E.g. we can create some function in object ParamsSuite or other places `checkMultiColumnParams(obj: Params, sampleData: Dataset[_])` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19876 **[Test build #85204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85204/testReport)** for PR 19876 at commit [`6e9cdc3`](https://github.com/apache/spark/commit/6e9cdc37d63af8631e2595a6e20b67f495758d39). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 @srowen I think, we can merge this now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158113948 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- @srowen I mis-understood your first comment. I have reverted as suggested. Please check now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20037: [SPARK-22849] ivy.retrieve pattern should also consider ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20037 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML expo...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r158113236 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -994,6 +998,38 @@ class LinearRegressionSuite LinearRegressionSuite.allParamSettings, checkModelData) } + test("pmml export") { +val lr = new LinearRegression() +val model = lr.fit(datasetWithWeight) +def checkModel(pmml: PMML): Unit = { + val dd = pmml.getDataDictionary + assert(dd.getNumberOfFields === 3) + val fields = dd.getDataFields.asScala + assert(fields(0).getName().toString === "field_0") + assert(fields(0).getOpType() == OpType.CONTINUOUS) + val pmmlRegressionModel = pmml.getModels().get(0).asInstanceOf[PMMLRegressionModel] + val pmmlPredictors = pmmlRegressionModel.getRegressionTables.get(0).getNumericPredictors + val pmmlWeights = pmmlPredictors.asScala.map(_.getCoefficient()).toList + assert(pmmlWeights(0) ~== model.coefficients(0) relTol 1E-3) + assert(pmmlWeights(1) ~== model.coefficients(1) relTol 1E-3) +} +testPMMLWrite(sc, model, checkModel) + } + + test("unsupported export format") { --- End diff -- Sure, I'll put a dummy writer in test so it doesn't clog up our class space. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20037: [SPARK-22849] ivy.retrieve pattern should also co...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20037 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19884 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19884#discussion_r158112470 --- Diff: python/pyspark/sql/tests.py --- @@ -3356,6 +3356,7 @@ def test_schema_conversion_roundtrip(self): self.assertEquals(self.schema, schema_rt) +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed") --- End diff -- @ueshin @HyukjinKwon just confirming that this test should be conditional on pandas/pyarrow being installed as we will check for a minimum pyarrow version when using `pandas_udf `? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19884 **[Test build #85203 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85203/testReport)** for PR 19884 at commit [`715f83d`](https://github.com/apache/spark/commit/715f83dfb96823fc79bca0fcd904c1ddeddaf6d6). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19884 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85203/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20037: [SPARK-22849] ivy.retrieve pattern should also co...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20037#discussion_r158112362 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -1271,7 +1271,7 @@ private[spark] object SparkSubmitUtils { // retrieve all resolved dependencies ivy.retrieve(rr.getModuleDescriptor.getModuleRevisionId, packagesDirectory.getAbsolutePath + File.separator + -"[organization]_[artifact]-[revision].[ext]", +"[organization]_[artifact]-[revision](-[classifier]).[ext]", --- End diff -- Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19884 **[Test build #85203 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85203/testReport)** for PR 19884 at commit [`715f83d`](https://github.com/apache/spark/commit/715f83dfb96823fc79bca0fcd904c1ddeddaf6d6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19940 **[Test build #85202 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85202/testReport)** for PR 19940 at commit [`c650196`](https://github.com/apache/spark/commit/c650196a4c0c4391d3112678803bd601af9aa5fb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20037: [SPARK-22849] ivy.retrieve pattern should also consider ...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20037 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/19876 @sethah So I'm hesitant to push an API without an implementation to make sure its actually usable for our goal. But I'm fine splitting it out into a separate PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/19884 Ok I did some local testing with these changes and pyarrow 0.8.0 with different combinations of Python and Pandas: **python 3.6.3, pandas 0.19.2** ERROR: test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ``` pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[us, tz=UTC] would lose data: 288001 ``` **python 3.6.3, pandas 0.21.0** All tests pass **python 2.7.14, pandas 0.21.1** All tests pass **python 2.7.14, pandas 0.19.2** ERROR: test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ``` pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[us, tz=UTC] would lose data: 288001 ``` It seems like pandas 0.19.2 has a timestamp issue, but let's see if it is reproduced in the Jenkins env here cc @wesm --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19884 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85195/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19884 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19884 **[Test build #85195 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85195/testReport)** for PR 19884 at commit [`d92ae90`](https://github.com/apache/spark/commit/d92ae90e05f55955eaad8e7f55e6324bf333a6bc). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 @srowen Thank you for re-run, now it passes all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20033: [SPARK-22847] [CORE] Remove redundant code in AppStatusL...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20033 **[Test build #4017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4017/testReport)** for PR 20033 at commit [`ace04a5`](https://github.com/apache/spark/commit/ace04a5c75a0dc46e0575677be6be77ab6b58895). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19035: [SPARK-21822][SQL]When insert Hive Table is finished, it...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19035 @figo77 Could you close it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19035: [SPARK-21822][SQL]When insert Hive Table is finished, it...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19035 I think the problem is gone, right? Could you see the latest file version: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20016 **[Test build #4019 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4019/testReport)** for PR 20016 at commit [`319e282`](https://github.com/apache/spark/commit/319e282223d85018d4efd48065b5dc54be1149e9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20036 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20036 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85196/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19725: [DO NOT REVIEW][SPARK-22042] [SQL] Insert shuffle...
Github user tejasapatil closed the pull request at: https://github.com/apache/spark/pull/19725 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20036 **[Test build #85196 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85196/testReport)** for PR 20036 at commit [`53661eb`](https://github.com/apache/spark/commit/53661eb72bba55376bc6112b51c25489522d309c). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20017: [SPARK-22832][ML] BisectingKMeans unpersist unuse...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20017 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20017: [SPARK-22832][ML] BisectingKMeans unpersist unused datas...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20017 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19978: [SPARK-22784][CORE][WIP] Configure reading buffer size i...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19978 @MikhailErofeev was the conclusion here that you just want to set spark.taskMetrics.trackUpdatedBlockStatuses in your case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20012: [SPARK-22824] Restore old offset for binary compa...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20012 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19813 I found a few problems that this PR can ideally solve. If this is not available soon, I will use workaround in upcoming PRs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20016 **[Test build #4019 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4019/testReport)** for PR 20016 at commit [`319e282`](https://github.com/apache/spark/commit/319e282223d85018d4efd48065b5dc54be1149e9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19118: [SPARK-21882][CORE] OutputMetrics doesn't count written ...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19118 The change looks good to me, pending test case; thanks for looking into this @awarrior --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19813 I agree that string replacement is too dangerous (e.g. `a + 1 = a + 10` with `a + 1`). How about a contract with adding assertions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20012: [SPARK-22824] Restore old offset for binary compatibilit...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20012 LGTM. Thanks! Merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20018 Adding other contributor of the same file for review. cc\ @cloud-fan @aokolnychyi @liancheng @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20038: [SPARK-22836][ui] Show driver logs in UI when available.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20038 **[Test build #85201 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85201/testReport)** for PR 20038 at commit [`52641db`](https://github.com/apache/spark/commit/52641db34bdb9fac5b5808fa2c6334837eeedbc3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157941778 --- Diff: examples/src/main/scala/org/apache/spark/examples/HdfsTest.scala --- @@ -39,7 +39,7 @@ object HdfsTest { val start = System.currentTimeMillis() for (x <- mapped) { x + 2 } val end = System.currentTimeMillis() - println("Iteration " + iter + " took " + (end-start) + " ms") + println(s"Iteration ${iter} took ${(end-start)} ms") --- End diff -- @HyukjinKwon $end-start won't work, both are different variables see. I made changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157941813 --- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala --- @@ -100,7 +100,7 @@ object SparkALS { ITERATIONS = iters.getOrElse("5").toInt slices = slices_.getOrElse("2").toInt case _ => -System.err.println("Usage: SparkALS [M] [U] [F] [iters] [partitions]") +System.err.println(s"Usage: SparkALS [M] [U] [F] [iters] [partitions]") --- End diff -- @HyukjinKwon Addressed ! Kindly do review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157816044 --- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala --- @@ -95,7 +95,7 @@ object LocalALS { def showWarning() { System.err.println( - """WARN: This is a naive implementation of ALS and is given as an example! + s"""WARN: This is a naive implementation of ALS and is given as an example! --- End diff -- @mgaido91 Thank you for feedback, changed addressed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20038: [SPARK-22836][ui] Show driver logs in UI when ava...
GitHub user vanzin opened a pull request: https://github.com/apache/spark/pull/20038 [SPARK-22836][ui] Show driver logs in UI when available. Port code from the old executors listener to the new one, so that the driver logs present in the application start event are kept. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vanzin/spark SPARK-22836 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20038.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20038 commit 52641db34bdb9fac5b5808fa2c6334837eeedbc3 Author: Marcelo Vanzin Date: 2017-12-20T18:29:51Z [SPARK-22836][ui] Show driver logs in UI when available. Port code from the old executors listener to the new one, so that the driver logs present in the application start event are kept. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157794109 --- Diff: examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala --- @@ -127,11 +125,11 @@ object DFSReadWriteTest { spark.stop() if (localWordCount == dfsWordCount) { - println(s"Success! Local Word Count ($localWordCount) " + -s"and DFS Word Count ($dfsWordCount) agree.") + println(s"Success! Local Word Count ($localWordCount) +and DFS Word Count ($dfsWordCount) agree.") --- End diff -- @srowen Thanks for review, I did addressed changes. Please review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157816026 --- Diff: examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala --- @@ -49,12 +49,10 @@ object DFSReadWriteTest { } private def printUsage(): Unit = { -val usage: String = "DFS Read-Write Test\n" + -"\n" + -"Usage: localFile dfsDir\n" + -"\n" + -"localFile - (string) local file to use in test\n" + -"dfsDir - (string) DFS directory for read/write tests\n" +val usage = s"""DFS Read-Write Test --- End diff -- @mgaido91 Thank you for feedback, changed addressed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157794196 --- Diff: examples/src/main/scala/org/apache/spark/examples/LocalFileLR.scala --- @@ -58,10 +58,10 @@ object LocalFileLR { // Initialize w to a random value val w = DenseVector.fill(D) {2 * rand.nextDouble - 1} -println("Initial w: " + w) +println(s"Initial w: ${w}") --- End diff -- @srowen Thanks for review, I did addressed changes. Please review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157941808 --- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala --- @@ -80,7 +80,7 @@ object SparkALS { def showWarning() { System.err.println( - """WARN: This is a naive implementation of ALS and is given as an example! + s"""WARN: This is a naive implementation of ALS and is given as an example! --- End diff -- @HyukjinKwon Addressed ! Kindly do review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 @srowen Thanks for response, correct - i went through error of jenkins and found error Online[https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4018/console] which fixed accordingly to me and committed so please take a look, if not correct please suggest the same. Thank you --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158100765 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- Why do you turn the example listing off then on again? just remove those two lines --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20018 @srowen Can you please review and if everything seems correct then run test build --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org