[GitHub] spark pull request #19984: [SPARK-22789] Map-only continuous processing exec...

2017-12-20 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/19984#discussion_r158116796
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala
 ---
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.continuous
+
+import java.util.concurrent.atomic.AtomicLong
+
+import scala.collection.mutable
+
+import org.apache.spark.SparkEnv
+import org.apache.spark.internal.Logging
+import org.apache.spark.rpc.{RpcCallContext, RpcEndpointRef, RpcEnv, 
ThreadSafeRpcEndpoint}
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
+import org.apache.spark.sql.sources.v2.reader.{ContinuousReader, 
PartitionOffset}
+import org.apache.spark.sql.sources.v2.writer.{ContinuousWriter, 
WriterCommitMessage}
+import org.apache.spark.util.RpcUtils
+
+private[continuous] sealed trait EpochCoordinatorMessage extends 
Serializable
+
+// Driver epoch trigger message
+/**
+ * Atomically increment the current epoch and get the new value.
+ */
+private[sql] case class IncrementAndGetEpoch() extends 
EpochCoordinatorMessage
+
+// Init messages
+/**
+ * Set the reader and writer partition counts. Tasks may not be started 
until the coordinator
+ * has acknowledged these messages.
+ */
+private[sql] case class SetReaderPartitions(numPartitions: Int) extends 
EpochCoordinatorMessage
+case class SetWriterPartitions(numPartitions: Int) extends 
EpochCoordinatorMessage
+
+// Partition task messages
+/**
+ * Get the current epoch.
+ */
+private[sql] case class GetCurrentEpoch() extends EpochCoordinatorMessage
--- End diff --

nit: `case class` -> `case object`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19940
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85202/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19940
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19940
  
**[Test build #85202 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85202/testReport)**
 for PR 19940 at commit 
[`c650196`](https://github.com/apache/spark/commit/c650196a4c0c4391d3112678803bd601af9aa5fb).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...

2017-12-20 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19993
  
thanks @hhbyyh, I updated the PR according to your suggestion and previous 
comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19156: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interfa...

2017-12-20 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19156#discussion_r158138431
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala ---
@@ -35,237 +34,252 @@ class SummarizerSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   import SummaryBuilderImpl._
 
   private case class ExpectedMetrics(
-  mean: Seq[Double],
-  variance: Seq[Double],
+  mean: Vector,
+  variance: Vector,
   count: Long,
-  numNonZeros: Seq[Long],
-  max: Seq[Double],
-  min: Seq[Double],
-  normL2: Seq[Double],
-  normL1: Seq[Double])
+  numNonZeros: Vector,
+  max: Vector,
+  min: Vector,
+  normL2: Vector,
+  normL1: Vector)
 
   /**
-   * The input is expected to be either a sparse vector, a dense vector or 
an array of doubles
-   * (which will be converted to a dense vector)
-   * The expected is the list of all the known metrics.
+   * The input is expected to be either a sparse vector, a dense vector.
*
-   * The tests take an list of input vectors and a list of all the summary 
values that
-   * are expected for this input. They currently test against some fixed 
subset of the
-   * metrics, but should be made fuzzy in the future.
+   * The tests take an list of input vectors, and compare results with
+   * `mllib.stat.MultivariateOnlineSummarizer`. They currently test 
against some fixed subset
+   * of the metrics, but should be made fuzzy in the future.
*/
-  private def testExample(name: String, input: Seq[Any], exp: 
ExpectedMetrics): Unit = {
+  private def testExample(name: String, inputVec: Seq[(Vector, Double)],
+  exp: ExpectedMetrics, expWithoutWeight: ExpectedMetrics): Unit = {
 
-def inputVec: Seq[Vector] = input.map {
-  case x: Array[Double @unchecked] => Vectors.dense(x)
-  case x: Seq[Double @unchecked] => Vectors.dense(x.toArray)
-  case x: Vector => x
-  case x => throw new Exception(x.toString)
+val summarizer = {
+  val _summarizer = new MultivariateOnlineSummarizer
+  inputVec.foreach(v => _summarizer.add(OldVectors.fromML(v._1), v._2))
+  _summarizer
 }
 
-val summarizer = {
+val summarizerWithoutWeight = {
   val _summarizer = new MultivariateOnlineSummarizer
-  inputVec.foreach(v => _summarizer.add(OldVectors.fromML(v)))
+  inputVec.foreach(v => _summarizer.add(OldVectors.fromML(v._1)))
   _summarizer
 }
 
 // Because the Spark context is reset between tests, we cannot hold a 
reference onto it.
 def wrappedInit() = {
-  val df = inputVec.map(Tuple1.apply).toDF("features")
-  val col = df.col("features")
-  (df, col)
+  val df = inputVec.toDF("features", "weight")
+  val featuresCol = df.col("features")
+  val weightCol = df.col("weight")
+  (df, featuresCol, weightCol)
 }
 
 registerTest(s"$name - mean only") {
-  val (df, c) = wrappedInit()
-  compare(df.select(metrics("mean").summary(c), mean(c)), 
Seq(Row(exp.mean), summarizer.mean))
+  val (df, c, w) = wrappedInit()
+  compareRow(df.select(metrics("mean").summary(c, w), mean(c, 
w)).first(),
+Row(Row(summarizer.mean), exp.mean))
 }
 
-registerTest(s"$name - mean only (direct)") {
-  val (df, c) = wrappedInit()
-  compare(df.select(mean(c)), Seq(exp.mean))
+registerTest(s"$name - mean only w/o weight") {
+  val (df, c, _) = wrappedInit()
+  compareRow(df.select(metrics("mean").summary(c), mean(c)).first(),
+Row(Row(summarizerWithoutWeight.mean), expWithoutWeight.mean))
 }
 
 registerTest(s"$name - variance only") {
-  val (df, c) = wrappedInit()
-  compare(df.select(metrics("variance").summary(c), variance(c)),
-Seq(Row(exp.variance), summarizer.variance))
+  val (df, c, w) = wrappedInit()
+  compareRow(df.select(metrics("variance").summary(c, w), variance(c, 
w)).first(),
+Row(Row(summarizer.variance), exp.variance))
 }
 
-registerTest(s"$name - variance only (direct)") {
-  val (df, c) = wrappedInit()
-  compare(df.select(variance(c)), Seq(summarizer.variance))
+registerTest(s"$name - variance only w/o weight") {
+  val (df, c, _) = wrappedInit()
+  compareRow(df.select(metrics("variance").summary(c), 
variance(c)).first(),
+Row(Row(summarizerWithoutWeight.variance), 
expWithoutWeight.variance))
 }
 
 registerTest(s"$name - count only") {
-  val (df, c) = wrappedInit()
-  compare(df.select(metrics("count").s

[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19993
  
**[Test build #85209 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85209/testReport)**
 for PR 19993 at commit 
[`9872bfd`](https://github.com/apache/spark/commit/9872bfdb5a428a74ba387c5d86a271621fef0a04).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-20 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/20023
  
In am generally in favor of following the SQL standard. How about we do 
this. Let's make the standard behavior the default, and add a flag to revert to 
the old behavior. This allows us to ease users into the new behavior, and for 
us it can provide some data points on when we can remove the old behavior. I 
hope we can remove this for Spark 2.4 or later.

At the end of the day it will be a bit more work, as I'd definitely would 
make an effort to isolate the the two behaviors as much as possible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...

2017-12-20 Thread foxish
Github user foxish commented on the issue:

https://github.com/apache/spark/pull/20032
  
> (And what happens if set to 0/-ve ?)

We have a check preventing that in the option itself. The value should be 
strictly greater than 0 ms.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20032
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85208/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20032
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...

2017-12-20 Thread foxish
Github user foxish commented on the issue:

https://github.com/apache/spark/pull/20032
  
Within the allocator's control loop, it's all asynchronous requests being 
made for executor pods from the k8s API, so, each loop doesn't take very long. 
If a user were to set a very low value for the delay, it wouldn't necessarily 
make more requests because the creation is still [rate 
limited](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L111)
 by the time it takes for the executors launched in the previous round to 
become ready (which is around 1s typically, if not higher). So, in the worst 
case, we may end up polling the state of our internal data structures quickly 
adding to the driver's CPU load, but the data structures are updated async by 
watches, so, it doesn't increase the load on the K8s API server or have to 
fetch any state over the network. I can add a caveat to the documentation that 
setting it to a very low number of ms may increase load on th
 e driver.

If the intent with reducing batch internal is for people looking to spin up 
N executors as quickly as possible, if the resources are present, I'd recommend 
that they use the `batch.size` parameter instead and set that equal to 
num_executors. That would realize the same effect, and schedule all executors 
in a single loop.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20032
  
**[Test build #85208 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85208/testReport)**
 for PR 20032 at commit 
[`4adb04b`](https://github.com/apache/spark/commit/4adb04bc9d1e3a27e64f7b92b83e6885ba3a04dc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158134032
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
+downstream query performance due to File I/O, Bandwidth I/O, Network 
I/O, Disk I/O.
+To improve performance you can create single parquet file under each 
partition directory using 'repartition'
+on partitioned key for Hive table. When you add partition to table, 
there will be change in table DDL.
+Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED 
AS PARQUET;
+ */
+hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
+  .partitionBy("key").parquet(hiveExternalTableLocation)
+
+/*
+ You can also do coalesce to control number of files under each 
partitions, repartition does full shuffle and equal
+ data distribution to all partitions. here coalesce can reduce number 
of files to given 'Int' argument without
--- End diff --

Sentences need some cleanup here. What do you mean by 'Int' argument? maybe 
it's best to point people to the API docs rather than incompletely repeat it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158133877
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
--- End diff --

This is more stuff that should go in docs, not comments in an example. It 
kind of duplicates existing documentation. Is this commentary really needed to 
illustrate usage of the API? that's the only goal right here. 

What are small-small files? You have some inconsistent capitalization; 
Parquet should be capitalized but not file, bandwidth, etc.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158133606
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
--- End diff --

Oh just noticed this. You're using javadoc style comments here, but they 
won't have effect.
just use the `//` block style for comments that you see above, for 
consistency.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20016


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-20 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20016
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes.alloca...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20032
  
**[Test build #85208 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85208/testReport)**
 for PR 20032 at commit 
[`4adb04b`](https://github.com/apache/spark/commit/4adb04bc9d1e3a27e64f7b92b83e6885ba3a04dc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20002
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85200/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20002
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20002: [SPARK-22465][Core][WIP] Add a safety-check to RDD defau...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20002
  
**[Test build #85200 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85200/testReport)**
 for PR 20002 at commit 
[`ca6aa08`](https://github.com/apache/spark/commit/ca6aa08e3d2f6a053992fb31faed35baa46fb5a6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2017-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19992#discussion_r158131498
  
--- Diff: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala 
---
@@ -2022,12 +1947,7 @@ private[spark] object JsonProtocolSuite extends 
Assertions {
   |  "Port": 300
   |},
   |"Block ID": "rdd_0_0",
-  |"Storage Level": {
--- End diff --

I guess I mean, doesn't this no longer test whether it can read the 
verbose, old style format? like this test does here and the ones above, that 
are being removed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes...

2017-12-20 Thread foxish
Github user foxish commented on a diff in the pull request:

https://github.com/apache/spark/pull/20032#discussion_r158131312
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala
 ---
@@ -217,7 +217,7 @@ private[spark] class KubernetesClusterSchedulerBackend(
 .watch(new ExecutorPodsWatcher()))
 
 allocatorExecutor.scheduleWithFixedDelay(
-  allocatorRunnable, 0L, podAllocationInterval, TimeUnit.SECONDS)
+  allocatorRunnable, 0L, podAllocationInterval.toLong, 
TimeUnit.MILLISECONDS)
--- End diff --

Done. That's much better, thanks! 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

2017-12-20 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20029
  
This indeed is the primary change as it's open vs master. 
https://github.com/apache/spark/pull/19989 had some concerns about whether this 
affects correctness though?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19876
  
**[Test build #85207 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85207/testReport)**
 for PR 19876 at commit 
[`b8844c7`](https://github.com/apache/spark/commit/b8844c75b7b0278f19cd340c3c036935c43feef4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20032: [SPARK-22845] [Scheduler] Modify spark.kubernetes...

2017-12-20 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/20032#discussion_r158129065
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala
 ---
@@ -217,7 +217,7 @@ private[spark] class KubernetesClusterSchedulerBackend(
 .watch(new ExecutorPodsWatcher()))
 
 allocatorExecutor.scheduleWithFixedDelay(
-  allocatorRunnable, 0L, podAllocationInterval, TimeUnit.SECONDS)
+  allocatorRunnable, 0L, podAllocationInterval.toLong, 
TimeUnit.MILLISECONDS)
--- End diff --

Why not use `conf.getTimeAsMs` for `podAllocationInterval` ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20039: [SPARK-22850][core] Ensure queued events are delivered t...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20039
  
**[Test build #85206 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85206/testReport)**
 for PR 20039 at commit 
[`80b900a`](https://github.com/apache/spark/commit/80b900a3e4d26cc982df914ece79e5435b7ff5df).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20039: [SPARK-22850][core] Ensure queued events are deli...

2017-12-20 Thread vanzin
GitHub user vanzin opened a pull request:

https://github.com/apache/spark/pull/20039

[SPARK-22850][core] Ensure queued events are delivered to all event queues.

The code in LiveListenerBus was queueing events before start in the
queues themselves; so in situations like the following:

   bus.post(someEvent)
   bus.addToEventLogQueue(listener)
   bus.start()

"someEvent" would not be delivered to "listener" if that was the first
listener in the queue, because the queue wouldn't exist when the
event was posted.

This change buffers the events before starting the bus in the bus itself,
so that they can be delivered to all registered queues when the bus is
started.

Also tweaked the unit tests to cover the behavior above.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vanzin/spark SPARK-22850

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20039.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20039


commit 80b900a3e4d26cc982df914ece79e5435b7ff5df
Author: Marcelo Vanzin 
Date:   2017-12-20T20:17:43Z

[SPARK-22850][core] Ensure queued events are delivered to all event queues.

The code in LiveListenerBus was queueing events before start in the
queues themselves; so in situations like the following:

   bus.post(someEvent)
   bus.addToEventLogQueue(listener)
   bus.start()

"someEvent" would not be delivered to "listener" if that was the first
listener in the queue.

This change buffers the events before starting the bus in the bus itself,
so that they can be delivered to all registered queues when the bus is
started.

Also tweaked the unit tests to cover the behavior above.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML expo...

2017-12-20 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r158126871
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
 this
   }
 
+  // override for Java compatibility
+  override def session(sparkSession: SparkSession): this.type = 
super.session(sparkSession)
+
+  // override for Java compatibility
+  override def context(sqlContext: SQLContext): this.type = 
super.session(sqlContext.sparkSession)
+}
+
+/**
+ * A ML Writer which delegates based on the requested format.
+ */
+class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging {
+  private var source: String = "internal"
+
   /**
-   * Overwrites if the output path already exists.
+   * Specifies the format of ML export (e.g. PMML, internal, or
+   * the fully qualified class name for export).
*/
-  @Since("1.6.0")
-  def overwrite(): this.type = {
-shouldOverwrite = true
+  @Since("2.3.0")
+  def format(source: String): this.type = {
+this.source = source
 this
   }
 
+  /**
+   * Dispatches the save to the correct MLFormat.
+   */
+  @Since("2.3.0")
+  @throws[IOException]("If the input path already exists but overwrite is 
not enabled.")
+  @throws[SparkException]("If multiple sources for a given short name 
format are found.")
+  override protected def saveImpl(path: String) = {
+val loader = Utils.getContextOrSparkClassLoader
+val serviceLoader = ServiceLoader.load(classOf[MLFormatRegister], 
loader)
+val stageName = stage.getClass.getName
+val targetName = s"${source}+${stageName}"
+val formats = serviceLoader.asScala.toList
+val shortNames = formats.map(_.shortName())
+val writerCls = 
formats.filter(_.shortName().equalsIgnoreCase(targetName)) match {
+  // requested name did not match any given registered alias
+  case Nil =>
+Try(loader.loadClass(source)) match {
+  case Success(writer) =>
+// Found the ML writer using the fully qualified path
+writer
+  case Failure(error) =>
+throw new SparkException(
+  s"Could not load requested format $source for $stageName 
($targetName) had $formats" +
+  s"supporting $shortNames", error)
+}
+  case head :: Nil =>
+head.getClass
+  case _ =>
+// Multiple sources
+throw new SparkException(
+  s"Multiple writers found for $source+$stageName, try using the 
class name of the writer")
+}
+if (classOf[MLWriterFormat].isAssignableFrom(writerCls)) {
+  val writer = writerCls.newInstance().asInstanceOf[MLWriterFormat]
+  writer.write(path, sparkSession, optionMap, stage)
+} else {
+  throw new SparkException("ML source $source is not a valid 
MLWriterFormat")
--- End diff --

Good catch, I've added a test for this error message.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...

2017-12-20 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19788#discussion_r158124202
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
 ---
@@ -203,22 +203,23 @@ private ShuffleMetrics() {
   this.appId = appId;
   this.execId = execId;
   String[] blockId0Parts = blockIds[0].split("_");
-  if (blockId0Parts.length != 4 || 
!blockId0Parts[0].equals("shuffle")) {
+  if (blockId0Parts.length != 5 || 
!blockId0Parts[0].equals("shuffle")) {
--- End diff --

This format change can cause incompatibility between shuffle service and 
spark application - causing a restart of the cluster and update of all spark 
applications  I wish we had a better way to encode this information which 
was not so brittle.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...

2017-12-20 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19788#discussion_r158121392
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
 ---
@@ -165,13 +165,23 @@ public ManagedBuffer getBlockData(
   String execId,
   int shuffleId,
   int mapId,
-  int reduceId) {
+  int reduceId,
+  int length) {
--- End diff --

Please rename the variable - `length` is incorrect (here and other places), 
please rename to make it clear : `numBlocks `perhaps ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...

2017-12-20 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19788#discussion_r158120856
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
 ---
@@ -203,22 +203,23 @@ private ShuffleMetrics() {
   this.appId = appId;
   this.execId = execId;
   String[] blockId0Parts = blockIds[0].split("_");
-  if (blockId0Parts.length != 4 || 
!blockId0Parts[0].equals("shuffle")) {
+  if (blockId0Parts.length != 5 || 
!blockId0Parts[0].equals("shuffle")) {
 throw new IllegalArgumentException("Unexpected shuffle block id 
format: " + blockIds[0]);
   }
   this.shuffleId = Integer.parseInt(blockId0Parts[1]);
-  mapIdAndReduceIds = new int[2 * blockIds.length];
+  mapIdAndReduceIds = new int[3 * blockIds.length];
--- End diff --

Please update description of the variable as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...

2017-12-20 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19788#discussion_r158123441
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -812,10 +812,10 @@ private[spark] object MapOutputTracker extends 
Logging {
 logError(errorMessage)
 throw new MetadataFetchFailedException(shuffleId, startPartition, 
errorMessage)
   } else {
-for (part <- startPartition until endPartition) {
-  splitsByAddress.getOrElseUpdate(status.location, ArrayBuffer()) 
+=
-((ShuffleBlockId(shuffleId, mapId, part), 
status.getSizeForBlock(part)))
-}
+val totalSize: Long = (startPartition until 
endPartition).map(status.getSizeForBlock).sum
+splitsByAddress.getOrElseUpdate(status.location, ArrayBuffer()) +=
+  ((ShuffleBlockId(shuffleId, mapId, startPartition, endPartition 
- startPartition),
+totalSize))
--- End diff --

This is going to create some very heavy shuffle fetches - and looks 
incorrect.
This merge should not be happening here, but in 
`ShuffleBlockFetcherIterator`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...

2017-12-20 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19788#discussion_r158124569
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
 ---
@@ -165,13 +165,23 @@ public ManagedBuffer getBlockData(
   String execId,
   int shuffleId,
   int mapId,
-  int reduceId) {
+  int reduceId,
+  int length) {
 ExecutorShuffleInfo executor = executors.get(new AppExecId(appId, 
execId));
 if (executor == null) {
   throw new RuntimeException(
 String.format("Executor is not registered (appId=%s, execId=%s)", 
appId, execId));
 }
-return getSortBasedShuffleBlockData(executor, shuffleId, mapId, 
reduceId);
+return getSortBasedShuffleBlockData(executor, shuffleId, mapId, 
reduceId, length);
+  }
+
+  public ManagedBuffer getBlockData(
+  String appId,
+  String execId,
+  int shuffleId,
+  int mapId,
+  int reduceId) {
--- End diff --

Remove this method ? We dont need it anymore


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...

2017-12-20 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19788#discussion_r158124746
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java
 ---
@@ -59,9 +59,9 @@ public int getSize() {
   /**
* Get index offset for a particular reducer.
*/
-  public ShuffleIndexRecord getIndex(int reduceId) {
+  public ShuffleIndexRecord getIndex(int reduceId, int length) {
--- End diff --

perhaps `require` that length (number of Blocks) is >= 1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20037: [SPARK-22849] ivy.retrieve pattern should also consider ...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20037
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85197/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20037: [SPARK-22849] ivy.retrieve pattern should also consider ...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20037
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20037: [SPARK-22849] ivy.retrieve pattern should also consider ...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20037
  
**[Test build #85197 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85197/testReport)**
 for PR 20037 at commit 
[`331ba33`](https://github.com/apache/spark/commit/331ba338ce020a927fcfd88b3bd7e536fc8d3b66).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20002: [SPARK-22465][Core][WIP] Add a safety-check to RD...

2017-12-20 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/20002#discussion_r158108582
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -57,7 +60,8 @@ object Partitioner {
   def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
 val rdds = (Seq(rdd) ++ others)
 val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions 
> 0))
-if (hasPartitioner.nonEmpty) {
+if (hasPartitioner.nonEmpty
+  && isEligiblePartitioner(hasPartitioner.maxBy(_.partitions.length), 
rdds)) {
--- End diff --

`hasPartitioner.maxBy(_.partitions.length)` is used repeatedly, pull that 
into a variable ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20002: [SPARK-22465][Core][WIP] Add a safety-check to RD...

2017-12-20 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/20002#discussion_r158119432
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -67,6 +71,16 @@ object Partitioner {
   }
 }
   }
+
+  /**
+   * Returns true if the number of partitions of the RDD is either greater 
than or is
+   * less than and within a single order of magnitude of the max number of 
upstream partitions;
+   * otherwise, returns false
+   */
+  private def isEligiblePartitioner(hasMaxPartitioner: RDD[_], rdds: 
Seq[RDD[_]]): Boolean = {
+val maxPartitions = rdds.map(_.partitions.length).max
+log10(maxPartitions).floor - 
log10(hasMaxPartitioner.getNumPartitions).floor < 1
--- End diff --

Why `.floor` ?
It causes unnecessary discontinuity imo, for example: (9, 11) will not 
satisfy - but it should.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20035
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85198/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20035
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20035: [SPARK-22848][SQL] Eliminate mutable state from Stack

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20035
  
**[Test build #85198 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85198/testReport)**
 for PR 20035 at commit 
[`f0163e7`](https://github.com/apache/spark/commit/f0163e7b68aa09fef5c1dc7f25e00170354a1ab2).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19884
  
**[Test build #85205 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85205/testReport)**
 for PR 19884 at commit 
[`faa9f09`](https://github.com/apache/spark/commit/faa9f09faabdc8047b58283f9101142eecf1c754).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML expo...

2017-12-20 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r158114844
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
 this
   }
 
+  // override for Java compatibility
+  override def session(sparkSession: SparkSession): this.type = 
super.session(sparkSession)
+
+  // override for Java compatibility
+  override def context(sqlContext: SQLContext): this.type = 
super.session(sqlContext.sparkSession)
+}
+
+/**
+ * A ML Writer which delegates based on the requested format.
+ */
+class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging {
--- End diff --

So I don't think that belongs in the base GeneralMLWriter, but we could 
make a trait for writers which support PMML to mix in?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19876
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19876
  
**[Test build #85204 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85204/testReport)**
 for PR 19876 at commit 
[`6e9cdc3`](https://github.com/apache/spark/commit/6e9cdc37d63af8631e2595a6e20b67f495758d39).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19876
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85204/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...

2017-12-20 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/19993
  
To make it available for other classes, we need to support checking for 
both `fit` and `transform`, that means we also need a sample input Dataset, so 
we may have to add the explicit test in each of the test suite. But we can 
still create some infrastructure function for the explicit test to invoke.
E.g. we can create some function in object ParamsSuite or other places 
`checkMultiColumnParams(obj: Params, sampleData: Dataset[_])`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19876
  
**[Test build #85204 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85204/testReport)**
 for PR 19876 at commit 
[`6e9cdc3`](https://github.com/apache/spark/commit/6e9cdc37d63af8631e2595a6e20b67f495758d39).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20016
  
@srowen I think, we can merge this now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158113948
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

@srowen I mis-understood your first comment. I have reverted as suggested. 
Please check now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20037: [SPARK-22849] ivy.retrieve pattern should also consider ...

2017-12-20 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20037
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML expo...

2017-12-20 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r158113236
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -994,6 +998,38 @@ class LinearRegressionSuite
   LinearRegressionSuite.allParamSettings, checkModelData)
   }
 
+  test("pmml export") {
+val lr = new LinearRegression()
+val model = lr.fit(datasetWithWeight)
+def checkModel(pmml: PMML): Unit = {
+  val dd = pmml.getDataDictionary
+  assert(dd.getNumberOfFields === 3)
+  val fields = dd.getDataFields.asScala
+  assert(fields(0).getName().toString === "field_0")
+  assert(fields(0).getOpType() == OpType.CONTINUOUS)
+  val pmmlRegressionModel = 
pmml.getModels().get(0).asInstanceOf[PMMLRegressionModel]
+  val pmmlPredictors = 
pmmlRegressionModel.getRegressionTables.get(0).getNumericPredictors
+  val pmmlWeights = 
pmmlPredictors.asScala.map(_.getCoefficient()).toList
+  assert(pmmlWeights(0) ~== model.coefficients(0) relTol 1E-3)
+  assert(pmmlWeights(1) ~== model.coefficients(1) relTol 1E-3)
+}
+testPMMLWrite(sc, model, checkModel)
+  }
+
+  test("unsupported export format") {
--- End diff --

Sure, I'll put a dummy writer in test so it doesn't clog up our class space.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20037: [SPARK-22849] ivy.retrieve pattern should also co...

2017-12-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20037


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19884
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

2017-12-20 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/19884#discussion_r158112470
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3356,6 +3356,7 @@ def test_schema_conversion_roundtrip(self):
 self.assertEquals(self.schema, schema_rt)
 
 
+@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
--- End diff --

@ueshin @HyukjinKwon just confirming that this test should be conditional 
on pandas/pyarrow being installed as we will check for a minimum pyarrow 
version when using `pandas_udf `?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19884
  
**[Test build #85203 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85203/testReport)**
 for PR 19884 at commit 
[`715f83d`](https://github.com/apache/spark/commit/715f83dfb96823fc79bca0fcd904c1ddeddaf6d6).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19884
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85203/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20037: [SPARK-22849] ivy.retrieve pattern should also co...

2017-12-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20037#discussion_r158112362
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -1271,7 +1271,7 @@ private[spark] object SparkSubmitUtils {
 // retrieve all resolved dependencies
 ivy.retrieve(rr.getModuleDescriptor.getModuleRevisionId,
   packagesDirectory.getAbsolutePath + File.separator +
-"[organization]_[artifact]-[revision].[ext]",
+"[organization]_[artifact]-[revision](-[classifier]).[ext]",
--- End diff --

Thanks! 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19884
  
**[Test build #85203 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85203/testReport)**
 for PR 19884 at commit 
[`715f83d`](https://github.com/apache/spark/commit/715f83dfb96823fc79bca0fcd904c1ddeddaf6d6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19940
  
**[Test build #85202 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85202/testReport)**
 for PR 19940 at commit 
[`c650196`](https://github.com/apache/spark/commit/c650196a4c0c4391d3112678803bd601af9aa5fb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20037: [SPARK-22849] ivy.retrieve pattern should also consider ...

2017-12-20 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/20037
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...

2017-12-20 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/19876
  
@sethah So I'm hesitant to push an API without an implementation to make 
sure its actually usable for our goal. But I'm fine splitting it out into a 
separate PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

2017-12-20 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/19884
  
Ok I did some local testing with these changes and pyarrow 0.8.0 with 
different combinations of Python and Pandas:

**python 3.6.3, pandas 0.19.2**

ERROR: test_createDataFrame_does_not_modify_input 
(pyspark.sql.tests.ArrowTests)
```
pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[us, 
tz=UTC] would lose data: 288001
```

**python 3.6.3, pandas 0.21.0**

All tests pass

**python 2.7.14, pandas 0.21.1**

All tests pass

**python 2.7.14, pandas 0.19.2**

ERROR: test_createDataFrame_does_not_modify_input 
(pyspark.sql.tests.ArrowTests)
```
pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[us, 
tz=UTC] would lose data: 288001
```
It seems like pandas 0.19.2 has a timestamp issue, but let's see if it is 
reproduced in the Jenkins env here

cc @wesm 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19884
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85195/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19884
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19884: [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19884
  
**[Test build #85195 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85195/testReport)**
 for PR 19884 at commit 
[`d92ae90`](https://github.com/apache/spark/commit/d92ae90e05f55955eaad8e7f55e6324bf333a6bc).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20016
  
@srowen Thank you for re-run, now it passes all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20033: [SPARK-22847] [CORE] Remove redundant code in AppStatusL...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20033
  
**[Test build #4017 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4017/testReport)**
 for PR 20033 at commit 
[`ace04a5`](https://github.com/apache/spark/commit/ace04a5c75a0dc46e0575677be6be77ab6b58895).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19035: [SPARK-21822][SQL]When insert Hive Table is finished, it...

2017-12-20 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19035
  
@figo77 Could you close it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19035: [SPARK-21822][SQL]When insert Hive Table is finished, it...

2017-12-20 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19035
  
I think the problem is gone, right? Could you see the latest file version:

 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20016
  
**[Test build #4019 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4019/testReport)**
 for PR 20016 at commit 
[`319e282`](https://github.com/apache/spark/commit/319e282223d85018d4efd48065b5dc54be1149e9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20036
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20036
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85196/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19725: [DO NOT REVIEW][SPARK-22042] [SQL] Insert shuffle...

2017-12-20 Thread tejasapatil
Github user tejasapatil closed the pull request at:

https://github.com/apache/spark/pull/19725


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20036: [SPARK-18016][SQL][FOLLOW-UP] Code Generation: Constant ...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20036
  
**[Test build #85196 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85196/testReport)**
 for PR 20036 at commit 
[`53661eb`](https://github.com/apache/spark/commit/53661eb72bba55376bc6112b51c25489522d309c).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20017: [SPARK-22832][ML] BisectingKMeans unpersist unuse...

2017-12-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20017


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20017: [SPARK-22832][ML] BisectingKMeans unpersist unused datas...

2017-12-20 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20017
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19978: [SPARK-22784][CORE][WIP] Configure reading buffer size i...

2017-12-20 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19978
  
@MikhailErofeev was the conclusion here that you just want to set 
spark.taskMetrics.trackUpdatedBlockStatuses in your case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20012: [SPARK-22824] Restore old offset for binary compa...

2017-12-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20012


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-20 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19813
  
I found a few problems that this PR can ideally solve. If this is not 
available soon, I will use workaround in upcoming PRs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20016
  
**[Test build #4019 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4019/testReport)**
 for PR 20016 at commit 
[`319e282`](https://github.com/apache/spark/commit/319e282223d85018d4efd48065b5dc54be1149e9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19118: [SPARK-21882][CORE] OutputMetrics doesn't count written ...

2017-12-20 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19118
  
The change looks good to me, pending test case; thanks for looking into 
this @awarrior 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-20 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19813
  
I agree that string replacement is too dangerous (e.g. `a + 1 = a + 10` 
with `a + 1`).  
How about a contract with adding assertions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20012: [SPARK-22824] Restore old offset for binary compatibilit...

2017-12-20 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/20012
  
LGTM. Thanks! Merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20018
  
Adding other contributor of the same file for review. cc\
@cloud-fan 
@aokolnychyi
@liancheng 
@HyukjinKwon


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20038: [SPARK-22836][ui] Show driver logs in UI when available.

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20038
  
**[Test build #85201 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85201/testReport)**
 for PR 20038 at commit 
[`52641db`](https://github.com/apache/spark/commit/52641db34bdb9fac5b5808fa2c6334837eeedbc3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157941778
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/HdfsTest.scala 
---
@@ -39,7 +39,7 @@ object HdfsTest {
   val start = System.currentTimeMillis()
   for (x <- mapped) { x + 2 }
   val end = System.currentTimeMillis()
-  println("Iteration " + iter + " took " + (end-start) + " ms")
+  println(s"Iteration ${iter} took ${(end-start)} ms")
--- End diff --

@HyukjinKwon $end-start won't work, both are different variables see. I 
made changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157941813
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala 
---
@@ -100,7 +100,7 @@ object SparkALS {
 ITERATIONS = iters.getOrElse("5").toInt
 slices = slices_.getOrElse("2").toInt
   case _ =>
-System.err.println("Usage: SparkALS [M] [U] [F] [iters] 
[partitions]")
+System.err.println(s"Usage: SparkALS [M] [U] [F] [iters] 
[partitions]")
--- End diff --

@HyukjinKwon Addressed ! Kindly do review


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157816044
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala 
---
@@ -95,7 +95,7 @@ object LocalALS {
 
   def showWarning() {
 System.err.println(
-  """WARN: This is a naive implementation of ALS and is given as an 
example!
+  s"""WARN: This is a naive implementation of ALS and is given as an 
example!
--- End diff --

@mgaido91 Thank you for feedback, changed addressed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20038: [SPARK-22836][ui] Show driver logs in UI when ava...

2017-12-20 Thread vanzin
GitHub user vanzin opened a pull request:

https://github.com/apache/spark/pull/20038

[SPARK-22836][ui] Show driver logs in UI when available.

Port code from the old executors listener to the new one, so that
the driver logs present in the application start event are kept.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vanzin/spark SPARK-22836

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20038.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20038


commit 52641db34bdb9fac5b5808fa2c6334837eeedbc3
Author: Marcelo Vanzin 
Date:   2017-12-20T18:29:51Z

[SPARK-22836][ui] Show driver logs in UI when available.

Port code from the old executors listener to the new one, so that
the driver logs present in the application start event are kept.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157794109
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala ---
@@ -127,11 +125,11 @@ object DFSReadWriteTest {
 spark.stop()
 
 if (localWordCount == dfsWordCount) {
-  println(s"Success! Local Word Count ($localWordCount) " +
-s"and DFS Word Count ($dfsWordCount) agree.")
+  println(s"Success! Local Word Count ($localWordCount) 
+and DFS Word Count ($dfsWordCount) agree.")
--- End diff --

@srowen Thanks for review, I did addressed changes. Please review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157816026
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala ---
@@ -49,12 +49,10 @@ object DFSReadWriteTest {
   }
 
   private def printUsage(): Unit = {
-val usage: String = "DFS Read-Write Test\n" +
-"\n" +
-"Usage: localFile dfsDir\n" +
-"\n" +
-"localFile - (string) local file to use in test\n" +
-"dfsDir - (string) DFS directory for read/write tests\n"
+val usage = s"""DFS Read-Write Test 
--- End diff --

@mgaido91 Thank you for feedback, changed addressed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157794196
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/LocalFileLR.scala ---
@@ -58,10 +58,10 @@ object LocalFileLR {
 
 // Initialize w to a random value
 val w = DenseVector.fill(D) {2 * rand.nextDouble - 1}
-println("Initial w: " + w)
+println(s"Initial w: ${w}")
--- End diff --

@srowen Thanks for review, I did addressed changes. Please review


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157941808
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala 
---
@@ -80,7 +80,7 @@ object SparkALS {
 
   def showWarning() {
 System.err.println(
-  """WARN: This is a naive implementation of ALS and is given as an 
example!
+  s"""WARN: This is a naive implementation of ALS and is given as an 
example!
--- End diff --

@HyukjinKwon Addressed ! Kindly do review


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20016
  
@srowen Thanks for response, correct - i went through error of jenkins and 
found error 
Online[https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4018/console]
 which fixed accordingly to me and committed so please take a look, if not 
correct please suggest the same.
Thank you


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158100765
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

Why do you turn the example listing off then on again? just remove those 
two lines


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20018
  
@srowen Can you please review and if everything seems correct then run test 
build 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   >