[GitHub] spark issue #20571: [SPARK-23383][Build][Minor]Make a distribution should ex...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20571
  
**[Test build #87539 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87539/testReport)**
 for PR 20571 at commit 
[`b99c43c`](https://github.com/apache/spark/commit/b99c43c9d89b10ba34ae2bd3e31831abd6cd2513).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20571: [SPARK-23383][Build][Minor]Make a distribution should ex...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20571
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20571: [SPARK-23383][Build][Minor]Make a distribution should ex...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20571
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/957/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20571: [SPARK-23383][Build][Minor]Make a distribution should ex...

2018-02-18 Thread yaooqinn
Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/20571
  
OK, i will push a commit very soon


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...

2018-02-18 Thread sujith71955
Github user sujith71955 commented on the issue:

https://github.com/apache/spark/pull/20611
  


![image](https://user-images.githubusercontent.com/12999161/36362491-5a8437e4-155b-11e8-80fd-885a1ebf045c.png)
seems to be an invalid failure


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20635: [SPARK-23053][CORE][BRANCH-2.1] taskBinarySerialization ...

2018-02-18 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/20635
  
Jenkins, Ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20635: [SPARK-23053][CORE][BRANCH-2.1] taskBinarySerialization ...

2018-02-18 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/20635
  
lgtm assuming tests pass


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/950/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87538/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
**[Test build #87538 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87538/testReport)**
 for PR 20553 at commit 
[`8b711ce`](https://github.com/apache/spark/commit/8b711ce9d5a16992fbf1d4b3cb096f2d35f72559).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
Kubernetes integration test starting
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/950/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/956/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20553
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20553: [SPARK-23285][K8S] Add a config property for specifying ...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20553
  
**[Test build #87538 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87538/testReport)**
 for PR 20553 at commit 
[`8b711ce`](https://github.com/apache/spark/commit/8b711ce9d5a16992fbf1d4b3cb096f2d35f72559).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...

2018-02-18 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20553#discussion_r168972337
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
 ---
@@ -85,6 +85,12 @@ private[spark] object Config extends Logging {
   .stringConf
   .createOptional
 
+  val KUBERNETES_EXECUTOR_CORES =
+ConfigBuilder("spark.kubernetes.executor.cores")
+  .doc("Specify the CPU core request for each executor pod")
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...

2018-02-18 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20057#discussion_r168971182
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/jdbc/DB2Dialect.scala ---
@@ -49,4 +49,17 @@ private object DB2Dialect extends JdbcDialect {
   }
 
   override def isCascadingTruncateTable(): Option[Boolean] = Some(false)
+
+  /**
+   * The SQL query used to truncate a table.
+   * @param table The table to truncate.
+   * @param cascade Whether or not to cascade the truncation. Default 
value is the
+   *value of isCascadingTruncateTable(). Ignored for DB2 
as it is unsupported.
+   * @return The SQL query to use for truncating a table
+   */
+  override def getTruncateQuery(
+  table: String,
+  cascade: Option[Boolean] = isCascadingTruncateTable): String = {
+s"TRUNCATE TABLE $table"
+  }
--- End diff --

Since this is the same code from JdbcDialect, it seems we don't need to add 
this here.
Not only this DB2Dialect, but also DerbyDialect/MsSQLDialect/MySQLDialect.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...

2018-02-18 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20057#discussion_r168971074
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/jdbc/TeradataDialect.scala ---
@@ -31,4 +31,19 @@ private case object TeradataDialect extends JdbcDialect {
 case BooleanType => Option(JdbcType("CHAR(1)", java.sql.Types.CHAR))
 case _ => None
   }
+
+  override def isCascadingTruncateTable(): Option[Boolean] = Some(false)
+
+  /**
+   * The SQL query used to truncate a table.
+   * @param table The table to truncate.
+   * @param cascade Whether or not to cascade the truncation. Default 
value is the
+   *value of isCascadingTruncateTable(). Ignored for 
Teradata as it is unsupported
+   * @return The SQL query to use for truncating a table
+   */
+  override def getTruncateQuery(
+  table: String,
+  cascade: Option[Boolean] = isCascadingTruncateTable): String = {
+s"TRUNCATE TABLE $table"
--- End diff --

Currently, `TeradataDialect` seems to generate a unsupported syntax because 
it inherits from `JdbcDialects`. Before doing this PR, we had better check this.

Ping, @klinvill and @gatorsmile who is the original author and committer of 
this code. Could you give us some comment for this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...

2018-02-18 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20057#discussion_r168970903
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/jdbc/TeradataDialect.scala ---
@@ -31,4 +31,19 @@ private case object TeradataDialect extends JdbcDialect {
 case BooleanType => Option(JdbcType("CHAR(1)", java.sql.Types.CHAR))
 case _ => None
   }
+
+  override def isCascadingTruncateTable(): Option[Boolean] = Some(false)
+
+  /**
+   * The SQL query used to truncate a table.
+   * @param table The table to truncate.
+   * @param cascade Whether or not to cascade the truncation. Default 
value is the
+   *value of isCascadingTruncateTable(). Ignored for 
Teradata as it is unsupported
+   * @return The SQL query to use for truncating a table
+   */
+  override def getTruncateQuery(
+  table: String,
+  cascade: Option[Boolean] = isCascadingTruncateTable): String = {
+s"TRUNCATE TABLE $table"
--- End diff --

At the beginning, I saw `teradata` is missed in your test case. So, I asked 
to add that.

But, today, I did double-check that in order to review this line. 
Unfortunately, according to Teradata document, `TRUNCATE` syntax is not 
documented. And it's not supported according to some old community articles.

- Teradata SQL Reference
  - 
https://info.teradata.com/HTMLPubs/DB_TTU_16_00/index.html#page/SQL_Reference%2FB035-1146-160K%2Fpds1472240516459.html%23
- Community (DELETE/TRUNCATE COMMAND)
  - 
https://community.teradata.com/t5/Database/Delete-truncate-command/td-p/434


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...

2018-02-18 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20057#discussion_r168969792
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -102,7 +102,12 @@ object JdbcUtils extends Logging {
 val dialect = JdbcDialects.get(options.url)
 val statement = conn.createStatement
 try {
-  statement.executeUpdate(dialect.getTruncateQuery(options.table))
+  if (options.isCascadeTruncate.isDefined) {
+statement.executeUpdate(dialect.getTruncateQuery(options.table,
+  options.isCascadeTruncate))
+  } else {
+statement.executeUpdate(dialect.getTruncateQuery(options.table))
+  }
--- End diff --

What about the following?
```scala
val truncateQuery = if (options.isCascadeTruncate.isDefined) {
  dialect.getTruncateQuery(options.table, options.isCascadeTruncate))
} else {
  dialect.getTruncateQuery(options.table)
}
statement.executeUpdate(truncateQuery)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...

2018-02-18 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20057#discussion_r168969101
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala ---
@@ -85,15 +85,24 @@ private object PostgresDialect extends JdbcDialect {
 s"SELECT 1 FROM $table LIMIT 1"
   }
 
+  override def isCascadingTruncateTable(): Option[Boolean] = Some(false)
+
   /**
-  * The SQL query used to truncate a table. For Postgres, the default 
behaviour is to
-  * also truncate any descendant tables. As this is a (possibly unwanted) 
side-effect,
-  * the Postgres dialect adds 'ONLY' to truncate only the table in question
-  * @param table The name of the table.
-  * @return The SQL query to use for truncating a table
-  */
-  override def getTruncateQuery(table: String): String = {
-s"TRUNCATE TABLE ONLY $table"
+   * The SQL query used to truncate a table. For Postgres, the default 
behaviour is to
+   * also truncate any descendant tables. As this is a (possibly unwanted) 
side-effect,
+   * the Postgres dialect adds 'ONLY' to truncate only the table in 
question
+   * @param table The table to truncate
+   * @param cascade Whether or not to cascade the truncation. Default 
value is the value of
+   *isCascadingTruncateTable()
+   * @return The SQL query to use for truncating a table
+   */
+  override def getTruncateQuery(
+  table: String,
+  cascade: Option[Boolean] = isCascadingTruncateTable): String = {
+cascade match {
+  case Some(true) => s"TRUNCATE TABLE ONLY $table CASCADE"
--- End diff --

Yep. Thank you for testing. Could you add more description about this into 
the parameter description of `cascade`?
```
   * @param cascade Whether or not to cascade the truncation. Default value 
is the value of
   *isCascadingTruncateTable()
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-02-18 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20629
  
yes, I agree with you @MLnick. I'd like to ping also @jkbradley and @hhbyyh 
who drove the PR which introduce `computeCost`.

In order to move the same metric to `ClusteringEvaluator`, I think it is 
very useful what is done in this ongoing PR #20600, where `DistanceMeasure` is 
moved to its own file and the `cost` method is added to it. So maybe it can be 
added there after that PR. What do you think?

PS if you agree, may you please then help reviewing that PR?

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-18 Thread asolimando
Github user asolimando commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r168961174
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -640,4 +740,55 @@ private object RandomForestSuite {
 val (indices, values) = map.toSeq.sortBy(_._1).unzip
 Vectors.sparse(size, indices.toArray, values.toArray)
   }
+
+  /** Generate a label. */
+  private def generateLabel(rnd: Random, numClasses: Int): Double = {
+rnd.nextInt(numClasses)
+  }
+
+  /** Generate a numeric value in the range [numericMin, numericMax]. */
+  private def generateNumericValue(rnd: Random, numericMin: Double, 
numericMax: Double) : Double = {
+rnd.nextDouble() * (numericMax- numericMin) + numericMin
+  }
+
+  /** Generate a binary value. */
+  private def generateBinaryValue(rnd: Random) : Double = if 
(rnd.nextBoolean()) 1 else 0
--- End diff --

I have removed _generateBinaryValue_ and used _nextInt_ in place of 
"_nextBoolean + if_".

I have also in-lined the label generation as it was a similar case.

However, I would keep _generateNumericValue_ as a separate method, as I am 
afraid it would harm readability.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-18 Thread asolimando
Github user asolimando commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r168961183
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -640,4 +740,55 @@ private object RandomForestSuite {
 val (indices, values) = map.toSeq.sortBy(_._1).unzip
 Vectors.sparse(size, indices.toArray, values.toArray)
   }
+
+  /** Generate a label. */
+  private def generateLabel(rnd: Random, numClasses: Int): Double = {
+rnd.nextInt(numClasses)
+  }
+
+  /** Generate a numeric value in the range [numericMin, numericMax]. */
+  private def generateNumericValue(rnd: Random, numericMin: Double, 
numericMax: Double) : Double = {
+rnd.nextDouble() * (numericMax- numericMin) + numericMin
+  }
+
+  /** Generate a binary value. */
+  private def generateBinaryValue(rnd: Random) : Double = if 
(rnd.nextBoolean()) 1 else 0
+
+  /** Generate an array of binary values of length numBinary. */
+  private def generateBinaryArray(rnd: Random, numBinary: Int): 
Array[Double] = {
+Range.apply(0, numBinary).map(_ => generateBinaryValue(rnd)).toArray
--- End diff --

I have updated accordingly the generation of both binary and numeric 
values, it is indeed simpler this way


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-18 Thread asolimando
Github user asolimando commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r168961088
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -303,26 +303,6 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(split.threshold < 2020)
   }
 
-  test("Multiclass classification stump with 10-ary (ordered) categorical 
features") {
--- End diff --

I am sorry, two out of three of the adapted tests were errouneously not 
included in the PR, despite referred to in the description, my bad for not 
realising until now. 

I will describe what I did for those tests that I have added to my latest 
commit:

- "mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala"
  1. "Multiclass classification stump with 10-ary (ordered) categorical 
features"
  2. "Multiclass classification tree with 10-ary (ordered) categorical 
features with just enough bins"

I had to "adapt" the tests as they where checking properties related to the 
"learning phase" on the DecisionTree itself, but they were checking on nodes 
that are not there anymore (on those test trees) due to the pruning.

So, mimicking what is done extensively in the existing test suite, I have 
extracted the needed information using the "internal" methods of RandomForest 
(e.g., _RandomForest.findBestSplits_) and check them directly.

So the tests have been moved to ".../ml/tree/impl/RandomForestSuite.scala" 
and adapted to call the internal methods of the "ml" version of 
RandomForest/DecisionTree.

Anyway, internally, the _mllib_ version of _DecisionTree_ is anyway calling 
that of the _ml_ version. By looking at several other tests in 
".../ml/tree/impl/RandomForestSuite.scala" ("Avoid aggregation on the last 
level" test, for instance), I noticed that they were "a porting" as well, so I 
have followed their style.

To double-check the adaptation I did a debugging pass for comparing the 
values of the relevant objects in the two cases, and they were identical. 

The only removed test was "Use soft prediction for binary classification 
with ordered categorical features" 
in 
"mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala".

This test was a duplicate of the test having the same name and already 
located in ".../ml/tree/impl/RandomForestSuite.scala", 
and given that _DecisionTree_ is only a wrapper for _RandomForest_ with 
_numTree=1_, they where doing exactly the same thing (modulo the call inside 
the wrapper), and the coverage of the test cases is left untouched by its 
removal.

"Use soft prediction for binary classification with ordered categorical 
features" in ".../ml/tree/impl/RandomForestSuite.scala" had to be adapted too 
for the same reasons (it was checking information in nodes that are now pruned).

PS: I didn't check extensively as it was out of scope for the PR, but I 
remember that this was not the only case, there might be are redundant tests. 
If you think it could be a good idea to remove them, I can re-check and open a 
JIRA ticket with the gathered information.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-02-18 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/20629
  
Sorry I mean putting the metric in evaluator and then also deprecating
computCost
On Sun, 18 Feb 2018 at 20:41, Nick Pentreath 
wrote:

> Right - so while it’s perhaps a lower quality metric it is different. 
So I
> wonder if deprecation is the right approach (vs say putting the within
> cluster sum squares into ClusteringEvaluator).
>
> On Sun, 18 Feb 2018 at 20:35, Marco Gaido 
> wrote:
>
>> thanks for taking a look at this @MLnick .
>> No, it doesn't, in the sense that it returns a different result: this is
>> the sum of the squared euclidean distance between a point and the 
centroid
>> of the cluster it is assigned to, while the silhouette metric is the
>> average of the silhouette coefficient. So they are completely different
>> formulas.
>>
>> The semantic is a bit different too. Silhouette measures both cohesion
>> and separation of the clusters, while computeCost as it is measures only
>> cohesion.
>>
>> Nonetheless, of course both them can be used to evaluate the result of a
>> clustering algorithm, even though the silhouette is much better for this
>> purpose.
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> , or 
mute
>> the thread
>> 

>> .
>>
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-02-18 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/20629
  
Right - so while it’s perhaps a lower quality metric it is different. So I
wonder if deprecation is the right approach (vs say putting the within
cluster sum squares into ClusteringEvaluator).

On Sun, 18 Feb 2018 at 20:35, Marco Gaido  wrote:

> thanks for taking a look at this @MLnick . No,
> it doesn't, in the sense that it returns a different result: this is the
> sum of the squared euclidean distance between a point and the centroid of
> the cluster it is assigned to, while the silhouette metric is the average
> of the silhouette coefficient. So they are completely different formulas.
>
> The semantic is a bit different too. Silhouette measures both cohesion and
> separation of the clusters, while computeCost as it is measures only
> cohesion.
>
> Nonetheless, of course both them can be used to evaluate the result of a
> clustering algorithm, even though the silhouette is much better for this
> purpose.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-02-18 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20629
  
thanks for taking a look at this @MLnick. No, it doesn't, in the sense that 
it returns a different result: this is the sum of the squared euclidean 
distance between a point and the centroid of the cluster it is assigned to, 
while the silhouette metric is the average of the silhouette coefficient. So 
they are completely different formulas.

The semantic is a bit different too. Silhouette measures both cohesion and 
separation of the clusters, while `computeCost` as it is measures only cohesion.

Nonetheless, of course both them can be used to evaluate the result of a 
clustering algorithm, even though the silhouette is much better for this 
purpose.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20571: [SPARK-23383][Build][Minor]Make a distribution should ex...

2018-02-18 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20571
  
Yes @yaooqinn, it's easy enough to read the logic, but, my question is 
_why_ treat these cases differently? Why not go all the way to failing on any 
unexpected input?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

2018-02-18 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/20629
  
Just want to check - does `computeCost` do the same thing as the silhouette 
metric?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20571: [SPARK-23383][Build][Minor]Make a distribution should ex...

2018-02-18 Thread yaooqinn
Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/20571
  
sorry for not replying in time. the logic here is that firstly unrecognized 
- -options show warning message and usage information,secondly the - options 
end the while loop and treat the current one and following options as maven 
parameters which will be handled later by maven,and finally all other 
unrecognized options fail the process immediately.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20619
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87537/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20619
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20619
  
**[Test build #87537 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87537/testReport)**
 for PR 20619 at commit 
[`8bd02d8`](https://github.com/apache/spark/commit/8bd02d8692708ab58e31e19a3682af3a94550369).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20611
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87536/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20611
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20611
  
**[Test build #87536 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87536/testReport)**
 for PR 20611 at commit 
[`51a0415`](https://github.com/apache/spark/commit/51a0415decd634dbc7d338886ee93ddabb629909).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-18 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r168956209
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -640,4 +740,55 @@ private object RandomForestSuite {
 val (indices, values) = map.toSeq.sortBy(_._1).unzip
 Vectors.sparse(size, indices.toArray, values.toArray)
   }
+
+  /** Generate a label. */
+  private def generateLabel(rnd: Random, numClasses: Int): Double = {
+rnd.nextInt(numClasses)
+  }
+
+  /** Generate a numeric value in the range [numericMin, numericMax]. */
+  private def generateNumericValue(rnd: Random, numericMin: Double, 
numericMax: Double) : Double = {
+rnd.nextDouble() * (numericMax- numericMin) + numericMin
+  }
+
+  /** Generate a binary value. */
+  private def generateBinaryValue(rnd: Random) : Double = if 
(rnd.nextBoolean()) 1 else 0
+
+  /** Generate an array of binary values of length numBinary. */
+  private def generateBinaryArray(rnd: Random, numBinary: Int): 
Array[Double] = {
+Range.apply(0, numBinary).map(_ => generateBinaryValue(rnd)).toArray
--- End diff --

`Array.fill(numBinary)(generateBinaryValue(rnd))` is simpler


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-18 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r168956144
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -640,4 +740,55 @@ private object RandomForestSuite {
 val (indices, values) = map.toSeq.sortBy(_._1).unzip
 Vectors.sparse(size, indices.toArray, values.toArray)
   }
+
+  /** Generate a label. */
+  private def generateLabel(rnd: Random, numClasses: Int): Double = {
+rnd.nextInt(numClasses)
+  }
+
+  /** Generate a numeric value in the range [numericMin, numericMax]. */
+  private def generateNumericValue(rnd: Random, numericMin: Double, 
numericMax: Double) : Double = {
+rnd.nextDouble() * (numericMax- numericMin) + numericMin
+  }
+
+  /** Generate a binary value. */
+  private def generateBinaryValue(rnd: Random) : Double = if 
(rnd.nextBoolean()) 1 else 0
--- End diff --

These 1-line single-use methods strike me as overhead you don't need; I 
don't feel strongly about it but inlining it would be at least as readable 
IMHO. This could also be `rnd.nextInt(2)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-18 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r168956108
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -303,26 +303,6 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(split.threshold < 2020)
   }
 
-  test("Multiclass classification stump with 10-ary (ordered) categorical 
features") {
--- End diff --

Can these tests be adapted or are the really irrelevant? want to make sure 
this isn't leaving some cases uncovered that tests covered before, if possible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...

2018-02-18 Thread ivoson
Github user ivoson commented on the issue:

https://github.com/apache/spark/pull/20244
  
thank you for reviewing this @squito 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20635: [SPARK-23053][CORE][BRANCH-2.1] taskBinarySerialization ...

2018-02-18 Thread ivoson
Github user ivoson commented on the issue:

https://github.com/apache/spark/pull/20635
  
cc @squito 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20635: [SPARK-23053][CORE][BRANCH-2.1] taskBinarySerialization ...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20635
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20635: [SPARK-23053][CORE][BRANCH-2.1] taskBinarySerialization ...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20635
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20635: [SPARK-23053][CORE][BRANCH-2.1] taskBinarySeriali...

2018-02-18 Thread ivoson
GitHub user ivoson opened a pull request:

https://github.com/apache/spark/pull/20635

[SPARK-23053][CORE][BRANCH-2.1] taskBinarySerialization and task partitions 
calculate in DagScheduler.submitMissingTasks should keep the same RDD 
checkpoint status

## What changes were proposed in this pull request?
This PR backports [#20244](https://github.com/apache/spark/pull/20244) 

When we run concurrent jobs using the same rdd which is marked to do 
checkpoint. If one job has finished running the job, and start the process of 
RDD.doCheckpoint, while another job is submitted, then submitStage and 
submitMissingTasks will be called. In 
[submitMissingTasks](https://github.com/apache/spark/blob/branch-2.1/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L932),
 will serialize taskBinaryBytes and calculate task partitions which are both 
affected by the status of checkpoint, if the former is calculated before 
doCheckpoint finished, while the latter is calculated after doCheckpoint 
finished, when run task, rdd.compute will be called, for some rdds with 
particular partition type such as 
[UnionRDD](https://github.com/apache/spark/blob/branch-2.1/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala)
 who will do partition type cast, will get a ClassCastException because the 
part params is actually a CheckpointRDDPartition.
This error occurs  because rdd.doCheckpoint occurs in the same thread that 
called sc.runJob, while the task serialization occurs in the DAGSchedulers 
event loop.

## How was this patch tested?
the exist tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ivoson/spark branch-2.1-23053

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20635.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20635


commit bd88903aca841e6ad55144127e4c11e9844ef6ce
Author: huangtengfei 
Date:   2018-02-13T15:59:21Z

[SPARK-23053][CORE] taskBinarySerialization and task partitions calculate 
in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status

When we run concurrent jobs using the same rdd which is marked to do 
checkpoint. If one job has finished running the job, and start the process of 
RDD.doCheckpoint, while another job is submitted, then submitStage and 
submitMissingTasks will be called. In 
[submitMissingTasks](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961),
 will serialize taskBinaryBytes and calculate task partitions which are both 
affected by the status of checkpoint, if the former is calculated before 
doCheckpoint finished, while the latter is calculated after doCheckpoint 
finished, when run task, rdd.compute will be called, for some rdds with 
particular partition type such as 
[UnionRDD](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala)
 who will do partition type cast, will get a ClassCastException because the 
part params is actually a CheckpointRDDPartition.
This error occurs  because rdd.doCheckpoint occurs in the same thread that 
called sc.runJob, while the task serialization occurs in the DAGSchedulers 
event loop.

the exist uts and also add a test case in DAGScheduerSuite to show the 
exception case.

Author: huangtengfei 

Closes #20244 from ivoson/branch-taskpart-mistype.

Change-Id: I634009d51ae40336e9d0717d061213ff7e36e71f




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20619
  
**[Test build #87537 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87537/testReport)**
 for PR 20619 at commit 
[`8bd02d8`](https://github.com/apache/spark/commit/8bd02d8692708ab58e31e19a3682af3a94550369).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20619
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20619
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/955/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20619
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20611
  
**[Test build #87536 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87536/testReport)**
 for PR 20611 at commit 
[`51a0415`](https://github.com/apache/spark/commit/51a0415decd634dbc7d338886ee93ddabb629909).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-18 Thread asolimando
Github user asolimando commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r168946503
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -640,4 +689,96 @@ private object RandomForestSuite {
 val (indices, values) = map.toSeq.sortBy(_._1).unzip
 Vectors.sparse(size, indices.toArray, values.toArray)
   }
+
+  /** Generate a label. */
+  private def generateLabel(rnd: Random, numClasses: Int): Double = {
+rnd.nextInt(numClasses)
+  }
+
+  /** Generate a numeric value in the range [numericMin, numericMax]. */
+  private def generateNumericValue(rnd: Random, numericMin: Double, 
numericMax: Double) : Double = {
+rnd.nextDouble() * (Math.abs(numericMax) + Math.abs(numericMin)) + 
numericMin
+  }
+
+  /** Generate a binary value. */
+  private def generateBinaryValue(rnd: Random) : Double = if 
(rnd.nextBoolean()) 1 else 0
+
+  /** Generate an array of binary values of length numBinary. */
+  private def generateBinaryArray(rnd: Random, numBinary: Int): 
Array[Double] = {
+Range.apply(0, numBinary).map(_ => generateBinaryValue(rnd)).toArray
+  }
+
+  /** Generate an array of binary values of length numNumeric in the range
+* [numericMin, numericMax]. */
+  private def generateNumericArray(rnd: Random,
+   numNumeric: Int,
+   numericMin: Double,
+   numericMax: Double) : Array[Double] = {
+Range.apply(0, numNumeric).map(_ => generateNumericValue(rnd, 
numericMin, numericMax)).toArray
+  }
+
+  /** Generate a LabeledPoint with numNumeric numeric values followed by 
numBinary binary values. */
+  private def generatePoint(rnd: Random,
+numClasses: Int,
+numNumeric: Int = 10,
+numericMin: Double = -100,
+numericMax: Double = 100,
+numBinary: Int = 10): LabeledPoint = {
+val label = generateLabel(rnd, numClasses)
+val numericArray = generateNumericArray(rnd, numNumeric, numericMin, 
numericMax)
+val binaryArray = generateBinaryArray(rnd, numBinary)
+val vector = Vectors.dense(numericArray ++ binaryArray)
+
+new LabeledPoint(label, vector)
+  }
+
+  /** Data for tree redundancy tests which produces a non-trivial tree. */
+  private def generateRedundantPoints(sc: SparkContext,
+  numClasses: Int,
+  numPoints: Int,
+  rnd: Random): RDD[LabeledPoint] = {
+sc.parallelize(Range.apply(1, numPoints)
+  .map(_ => generatePoint(rnd, numClasses = numClasses, numNumeric = 
0, numBinary = 3)))
+  }
+
+  /**
+* Returns true iff the decision tree has at least one subtree that can 
be pruned
+* (i.e., all its leaves share the same prediction).
+* @param tree the tree to be tested
+* @return true iff the decision tree has at least one subtree that can 
be pruned.
+*/
+  private def isRedundant(tree: DecisionTreeModel): Boolean = 
_isRedundant(tree.rootNode)
+
+  /**
+* Returns true iff the node has at least one subtree that can be pruned
+* (i.e., all its leaves share the same prediction).
+* @param n the node to be tested
+* @return true iff the node has at least one subtree that can be 
pruned.
+*/
+  private def _isRedundant(n: Node): Boolean = n match {
+case n: InternalNode =>
+  _isRedundant(n.leftChild) || _isRedundant(n.leftChild) || 
canBePruned(n)
+case _ => false
+  }
+
+  /**
+* Returns true iff the subtree rooted at the given node can be pruned
+* (i.e., all its leaves share the same prediction).
+* @param n the node to be tested
+* @return returns true iff the subtree rooted at the given node can be 
pruned.
+*/
+  private def canBePruned(n: Node): Boolean = n match {
+case n: InternalNode =>
+  (leafPredictions(n.leftChild) ++ leafPredictions(n.rightChild)).size 
== 1
+case _ => false
+  }
+
+  /**
+* Given a node, the method returns the set of predictions appearing in 
the subtree rooted at it.
+* @return the set of predictions appearing in the subtree rooted at 
the given node.
+*/
+  private def leafPredictions(n: Node): Set[Double] = n match {
--- End diff --

I have updated the code for explicitly testing against the expected tree 
structure.
I have removed all the auxiliary methods, it saved a lot of space actually 
and the tests are more explicit 

[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-18 Thread asolimando
Github user asolimando commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r168946478
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala ---
@@ -287,6 +292,41 @@ private[tree] class LearningNode(
 }
   }
 
+  /**
+   * Method testing whether a node is a leaf.
+   * @return true iff a node is a leaf.
+   */
+  private def isLeafNode(): Boolean = leftChild.isEmpty && 
rightChild.isEmpty
+
+  /** True iff the node should be a leaf. */
+  private lazy val shouldBeLeaf: Boolean = leafPredictions.size == 1
+
+  /**
+   * Returns the set of (leaf) predictions appearing in the subtree rooted 
at the considered node.
+   * @return the set of (leaf) predictions appearing in the subtree rooted 
at the given node.
+   */
+  private def leafPredictions: Set[Double] = {
--- End diff --

For construction we need the distinct predictions as "redundancy" is 
non-monotonic, because from being true on subtrees, it can become false: 
left and right subtrees of a node can be redundant (left predicts only 0, 
right only 1) but the node itself is not (it can predict 0 or 1, depending on 
the split). This can be ascertained only but comparing the prediction values, 
you cannot derive it from whether two subtrees being redundant alone (as they 
can be redundant w.r.t. different values).

Consider this example:
1
--1.1
1.1.1 (pred 0)
1.1.2 (pred 0)
--1.2
1.2.1 (pred 1)
1.2.2 (pred 1)

This can be pruned to 
1
--1.1 (pred 0)
--1.2 (pred 1)

As 1.1 and 1.2 are both redundant, but 1 is not (and to verify this, you 
need their prediction values).

It is a different story if we go for storing "isRedundant" (the 
shouldBeLeaf of the original PR), that can instead be safely used to decide 
whether to prune. 

As you pointed out this extra information is stored only in LearningNode, 
so I think it doesn't make a significant difference if we "store" the set of 
predictions or shouldBeLeaf, as these objects will be disposed anyway at the 
end of the translation, while what matters is to save computation time and 
multiple explorations (as you originally suggested).

I would summarise it as follows, we either: 
1) store the set of predictions, or
2) shouldBeLeaf

(1) requires more memory but it doesn't introduce extra variables (less 
code to understand), (2) has dual pro/con

Does it make sense to you? Which option would you suggest?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20619
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87535/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20619
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20619
  
**[Test build #87535 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87535/testReport)**
 for PR 20619 at commit 
[`8bd02d8`](https://github.com/apache/spark/commit/8bd02d8692708ab58e31e19a3682af3a94550369).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

2018-02-18 Thread asolimando
Github user asolimando commented on a diff in the pull request:

https://github.com/apache/spark/pull/20632#discussion_r168946358
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala ---
@@ -287,6 +292,41 @@ private[tree] class LearningNode(
 }
   }
 
+  /**
+   * Method testing whether a node is a leaf.
+   * @return true iff a node is a leaf.
+   */
+  private def isLeafNode(): Boolean = leftChild.isEmpty && 
rightChild.isEmpty
+
+  /** True iff the node should be a leaf. */
+  private lazy val shouldBeLeaf: Boolean = leafPredictions.size == 1
+
+  /**
+   * Returns the set of (leaf) predictions appearing in the subtree rooted 
at the considered node.
+   * @return the set of (leaf) predictions appearing in the subtree rooted 
at the given node.
+   */
+  private def leafPredictions: Set[Double] = {
+
+val predBuffer = new scala.collection.mutable.HashSet[Double]
+
+// collect the (leaf) predictions in the left subtree, if any
+if (leftChild.isDefined) {
--- End diff --

Ok, now I get what you mean. In this case though, we keep the if to apply 
"short-circuiting" as suggested below.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...

2018-02-18 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20621#discussion_r168945339
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -407,6 +407,34 @@ object PartitioningUtils {
   Literal(bigDecimal)
 }
 
+val dateTry = Try {
+  // try and parse the date, if no exception occurs this is a 
candidate to be resolved as
+  // DateType
+  DateTimeUtils.getThreadLocalDateFormat.parse(raw)
+  // SPARK-23436: Casting the string to date may still return null if 
a bad Date is provided.
+  // This can happen since DateFormat.parse  may not use the entire 
text of the given string:
+  // so if there are extra-characters after the date, it returns 
correctly.
+  // We need to check that we can cast the raw string since we later 
can use Cast to get
+  // the partition values with the right DataType (see
+  // 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
+  val dateValue = Cast(Literal(raw), DateType).eval()
+  // Disallow DateType if the cast returned null
+  require(dateValue != null)
+  Literal.create(dateValue, DateType)
+}
+
+val timestampTry = Try {
+  val unescapedRaw = unescapePathName(raw)
+  // try and parse the date, if no exception occurs this is a 
candidate to be resolved as
+  // TimestampType
+  
DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
--- End diff --

Yes, you are right. the only change introduced is that some values which 
were previously wrongly inferred as dates, now will be inferred as strings. 
Everything else works as before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20553: [SPARK-23285][K8S] Add a config property for spec...

2018-02-18 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/20553#discussion_r168943784
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
 ---
@@ -85,6 +85,12 @@ private[spark] object Config extends Logging {
   .stringConf
   .createOptional
 
+  val KUBERNETES_EXECUTOR_CORES =
+ConfigBuilder("spark.kubernetes.executor.cores")
+  .doc("Specify the CPU core request for each executor pod")
--- End diff --

nit: it was lower case "cpu" in other config, and no "core"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20619
  
**[Test build #87535 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87535/testReport)**
 for PR 20619 at commit 
[`8bd02d8`](https://github.com/apache/spark/commit/8bd02d8692708ab58e31e19a3682af3a94550369).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20619
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/954/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20619
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...

2018-02-18 Thread danielvdende
Github user danielvdende commented on a diff in the pull request:

https://github.com/apache/spark/pull/20057#discussion_r168943159
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala ---
@@ -85,15 +85,24 @@ private object PostgresDialect extends JdbcDialect {
 s"SELECT 1 FROM $table LIMIT 1"
   }
 
+  override def isCascadingTruncateTable(): Option[Boolean] = Some(false)
+
   /**
-  * The SQL query used to truncate a table. For Postgres, the default 
behaviour is to
-  * also truncate any descendant tables. As this is a (possibly unwanted) 
side-effect,
-  * the Postgres dialect adds 'ONLY' to truncate only the table in question
-  * @param table The name of the table.
-  * @return The SQL query to use for truncating a table
-  */
-  override def getTruncateQuery(table: String): String = {
-s"TRUNCATE TABLE ONLY $table"
+   * The SQL query used to truncate a table. For Postgres, the default 
behaviour is to
+   * also truncate any descendant tables. As this is a (possibly unwanted) 
side-effect,
+   * the Postgres dialect adds 'ONLY' to truncate only the table in 
question
+   * @param table The table to truncate
+   * @param cascade Whether or not to cascade the truncation. Default 
value is the value of
+   *isCascadingTruncateTable()
+   * @return The SQL query to use for truncating a table
+   */
+  override def getTruncateQuery(
+  table: String,
+  cascade: Option[Boolean] = isCascadingTruncateTable): String = {
+cascade match {
+  case Some(true) => s"TRUNCATE TABLE ONLY $table CASCADE"
--- End diff --

Sure, I made a quick example, as you can see using `TRUNCATE TABLE ONLY 
$table CASCADE` will truncate the foreign key-ed table, but leave the 
inheritance relationship intact:
```postgres=# CREATE SCHEMA test;
CREATE SCHEMA
postgres=# CREATE TABLE parent(a INT);
CREATE TABLE
postgres=# ALTER TABLE parent ADD CONSTRAINT some_constraint PRIMARY KEY(a);
ALTER TABLE
postgres=# CREATE TABLE child(b INT) INHERITS (parent);
CREATE TABLE
postgres=# CREATE TABLE forkey(c INT);
CREATE TABLE
postgres=# ALTER TABLE forkey ADD FOREIGN KEY(c) REFERENCES parent(a);
ALTER TABLE

postgres=# INSERT INTO parent VALUES(1);
INSERT 0 1
postgres=# select * from parent;
 a
---
 1
(1 row)

postgres=# select * from child;
 a | b
---+---
(0 rows)

postgres=# INSERT INTO child VALUES(2);
INSERT 0 1
postgres=# select * from child;
 a | b
---+---
 2 |
(1 row)

postgres=# select * from parent;
 a
---
 1
 2
(2 rows)

postgres=# INSERT INTO forkey VALUES(1);
INSERT 0 1
postgres=# select * from forkey;
 c
---
 1
(1 row)

postgres=# TRUNCATE TABLE ONLY parent CASCADE;
NOTICE:  truncate cascades to table "forkey"
TRUNCATE TABLE
postgres=# select * from parent;
 a
---
 2
(1 row)

postgres=# select * from child;
 a | b
---+---
 2 |
(1 row)

postgres=# select * from forkey;
 c
---
(0 rows)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/20619
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20619
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87534/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20619
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...

2018-02-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20619
  
**[Test build #87534 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87534/testReport)**
 for PR 20619 at commit 
[`8bd02d8`](https://github.com/apache/spark/commit/8bd02d8692708ab58e31e19a3682af3a94550369).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org