date:20160815

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
Sorry, I gave a wrong answer at the beginning. Next time, I will review it 
more carefully before leaving the comment. Thank you for your work! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14660: [SPARK-17071][SQL] Fetch Parquet schema without another ...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14660
  
**[Test build #63828 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63828/consoleFull)**
 for PR 14660 at commit 
[`e1214d5`](https://github.com/apache/spark/commit/e1214d50035441fb96551683cf38ae3e49f07b7d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
:) I think about this issue again.

At this stage, could you make a PR for this?
I think you're the best person to do that. You made this optimizer and 
found the correct fix.

This was a nice change to investigate this optimizer and nullability 
propagation for me.

@gatorsmile . Thank you for reviewing this. I'll close this PR soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14660: [SPARK-17071][SQL] Fetch Parquet schema without a...

2016-08-15 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/14660

[SPARK-17071][SQL] Fetch Parquet schema without another Spark job when it 
is a single file to touch 

## What changes were proposed in this pull request?

It seems Spark executes another job to figure out schema always 
([ParquetFileFormat#L739-L778](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L739-L778)).

However, it seems it's a bit of overhead to touch only a single file. I ran 
a bench mark with the code below:

```scala
test("Benchmark for JSON writer") {
  withTempPath { path =>
Seq((1, 2D, 3L, "4")).toDF("a", "b", "c", "d")
  .write.format("parquet").save(path.getAbsolutePath)

val benchmark = new Benchmark("Parquet - read schema", 1)
benchmark.addCase("Parquet - read schema", 10) { _ =>
  spark.read.format("parquet").load(path.getCanonicalPath).schema
}
benchmark.run()
  }
}
```

with the results as below:

- **Before**

  ```scala
  Parquet - read schema:   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
  

  Parquet - read schema   47 /   49  0.0
46728419.0   1.0X
  ```

- **After**

  ```scala
  Parquet - read schema:   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
  

  Parquet - read schema2 /3  0.0
 1811673.0   1.0X
  ```

It seems it became 20X faster (although It is a small bit in total job 
run-time).

As a reference, it seems ORC is doing this within driver-side 
[OrcFileOperator.scala#L74-L83](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L74-L83).

## How was this patch tested?

Existing tests should cover this



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-17071

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14660.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14660


commit 614abbc6b7a03ff0d3e505697c0bbfec3b330c2b
Author: hyukjinkwon 
Date:   2016-08-16T05:42:29Z

Fetch Parquet schema within driver-side when there is single file to touch 
without another Spark job

commit e1214d50035441fb96551683cf38ae3e49f07b7d
Author: hyukjinkwon 
Date:   2016-08-16T05:46:12Z

Fix modifier




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...

2016-08-15 Thread dbtsai

Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/13796
  
@sethah Thank you for great work. I'll make another pass tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
One more try:
```Scala
val splitConjunctiveConditions: Seq[Expression] = 
splitConjunctivePredicates(filter.condition)
val conditions = splitConjunctiveConditions ++ filter.constraints
val leftConditions = 
conditions.filter(_.references.subsetOf(join.left.outputSet))
val rightConditions = 
conditions.filter(_.references.subsetOf(join.right.outputSet))

val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull)
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull)
```

Does this have a hole?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14616
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63821/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14616
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
Another version. : )

```Scala
val splitConjunctiveConditions: Seq[Expression] = 
splitConjunctivePredicates(filter.condition)
val conditions =
  splitConjunctiveConditions ++ 
filter.constraints.filter(_.isInstanceOf[IsNotNull])
val leftConditions = 
conditions.filter(_.references.subsetOf(join.left.outputSet))
val rightConditions = 
conditions.filter(_.references.subsetOf(join.right.outputSet))

val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull)
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14616
  
**[Test build #63821 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63821/consoleFull)**
 for PR 14616 at commit 
[`db84e25`](https://github.com/apache/spark/commit/db84e259749e6b339367fd42305f92a224407399).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14392
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63827/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14392
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14392
  
**[Test build #63827 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63827/consoleFull)**
 for PR 14392 at commit 
[`05afe23`](https://github.com/apache/spark/commit/05afe2342648160165722f483cd69251826cb68e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
How about another version?
```
val leftConditions =
  (splitConjunctiveConditions ++ 
filter.constraints.filter(_.isInstanceOf[IsNotNull]))
.filter(_.references.subsetOf(join.left.outputSet))
val rightConditions =
  (splitConjunctiveConditions ++ 
filter.constraints.filter(_.isInstanceOf[IsNotNull]))
.filter(_.references.subsetOf(join.right.outputSet))

val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull)
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
Oh, that would be perfect fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14182
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14182
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63826/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #63826 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63826/consoleFull)**
 for PR 14182 at commit 
[`fa69bc6`](https://github.com/apache/spark/commit/fa69bc6a045322de52e55666bcc2a04cd8486b36).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
How about this fix?
```
val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) ||
  filter.constraints.filter(_.isInstanceOf[IsNotNull])
.exists(expr => expr.references.subsetOf(join.left.outputSet) && 
canFilterOutNull(expr))
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) 
||
  filter.constraints.filter(_.isInstanceOf[IsNotNull])
.exists(expr => expr.references.subsetOf(join.right.outputSet) && 
canFilterOutNull(expr))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...

2016-08-15 Thread sethah

Github user sethah commented on the issue:

https://github.com/apache/spark/pull/13796
  
@dbtsai Thanks for taking the time to review this! Major items right now:

* Adding derivation to the aggregator doc (this is mostly finished, just 
fighting scala doc with Latex)
* Deciding whether to add initial model and tests in this PR or as a follow 
up
* Refactoring logistic regression helper classes to a separate file

Let me know if you see anything else.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14392
  
**[Test build #63825 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63825/consoleFull)**
 for PR 14392 at commit 
[`cc708b5`](https://github.com/apache/spark/commit/cc708b549455ad1d850e86198a84060086d30386).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14392
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63825/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14392
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-15 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r74876946
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,626 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14182
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14182
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63824/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #63824 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63824/consoleFull)**
 for PR 14182 at commit 
[`8844961`](https://github.com/apache/spark/commit/884496153f9aa512bc437c1c23361479b6b2bc7b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14359
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63822/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14359
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14359
  
**[Test build #63822 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63822/consoleFull)**
 for PR 14359 at commit 
[`f79f77c`](https://github.com/apache/spark/commit/f79f77ce49aa797e8432b56fd2ad115540be67cf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14392
  
**[Test build #63827 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63827/consoleFull)**
 for PR 14392 at commit 
[`05afe23`](https://github.com/apache/spark/commit/05afe2342648160165722f483cd69251826cb68e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
Another better fix is to use `nullable` in `Expression` for `IsNotNull` 
constraints. `filter.constraints.filter(_.isInstanceOf[IsNotNull])`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
`canFilterOutNull ` will cover almost all the cases. Sorry, I did not read 
the plan until you asked me to write a test case. Then, I realized the 
implementation of natural/using join is just using `coalesce`. As @hvanhovell 
and @nsyca said, that is just a syntactic sugar. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #63826 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63826/consoleFull)**
 for PR 14182 at commit 
[`fa69bc6`](https://github.com/apache/spark/commit/fa69bc6a045322de52e55666bcc2a04cd8486b36).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14506: [SPARK-16916][SQL] serde/storage properties shoul...

2016-08-15 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14506


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14659
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
Please let me think more on this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-08-15 Thread Sherry302

GitHub user Sherry302 opened a pull request:

https://github.com/apache/spark/pull/14659

[SPARK-16757] Set up Spark caller context to HDFS

## What changes were proposed in this pull request?

1. Pass `jobId` to Task.
2. Invoke Hadoop APIs. 

A new function `setCallerContext` is added in `Utils`. `setCallerContext` 
function invokes APIs of   `org.apache.hadoop.ipc.CallerContext` to set up 
spark caller contexts, which will be written into `hdfs-audit.log`.

For applications in Yarn client mode, `org.apache.hadoop.ipc.CallerContext` 
are called in `Task` and Yarn `Client`. For applications in Yarn cluster mode, 
`org.apache.hadoop.ipc.CallerContext` are be called in `Task` and 
`ApplicationMaster`.

The Spark caller contexts written into `hdfs-audit.log` are applications' 
name` {spark.app.name}` and `JobID_stageID_stageAttemptId_taskID_attemptNumbe`.

## How was this patch tested?
Manual Tests against some Spark applications in Yarn client mode and Yarn 
cluster mode. Need to check if spark caller contexts are written into HDFS 
hdfs-audit.log successfully.

For example, run SparkKmeans in Yarn client mode: 
`./bin/spark-submit  --master yarn --deploy-mode client --class 
org.apache.spark.examples.SparkKMeans 
examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar 
hdfs://localhost:9000/lr_big.txt 2 5`

Before:
There will be no Spark caller context in records of `hdfs-audit.log`.

After:
Spark caller contexts will be in records of `hdfs-audit.log`.
(_Note: spark caller context below since Hadoop caller context API was 
invoked in Yarn Client_)
`2016-07-21 13:52:30,802 INFO FSNamesystem.audit: allowed=true
ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=getfileinfo
src=/lr_big.txtdst=nullperm=nullproto=rpc
callerContext=SparkKMeans running on Spark 
`
(_Note: spark caller context below since Hadoop caller context API was 
invoked in Task_)
`2016-07-21 13:52:35,584 INFO FSNamesystem.audit: allowed=true
ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=open
src=/lr_big.txtdst=nullperm=nullproto=rpc
callerContext=JobId_0_StageID_0_stageAttemptId_0_taskID_0_attemptNumber_0`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Sherry302/spark callercontextSubmit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14659.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14659


commit ec6833d32ef14950b2d81790bc908992f6288815
Author: Weiqing Yang 
Date:   2016-08-16T04:11:41Z

[SPARK-16757] Set up Spark caller context to HDFS




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
Yep. I agree. `Expr` could be anything. However, this will reduce the scope 
of this optimization greatly. Is it okay for you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14506
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14506
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63818/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...

2016-08-15 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/14506
  
Thanks. Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
If that is not applicable, I agree with @gatorsmile .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
That just resolves a specific case. The expressions could be much more 
complex. `Coalesce` can be used in a very deep layer.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14506
  
**[Test build #63818 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63818/consoleFull)**
 for PR 14506 at commit 
[`3042af2`](https://github.com/apache/spark/commit/3042af2f0e9ae82e40d14e950a1036b9e417dbc9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
What about this if we could exclude those functions?
```scala
 val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) 
||
   filter.constraints.filter(_.isInstanceOf[IsNotNull])
-.exists(expr => 
join.left.outputSet.intersect(expr.references).nonEmpty)
+.exists(expr => !expr.isInstanceOf[Coalesce] &&
+  leftOuterAttributeSet.intersect(expr.references).nonEmpty)
 val rightHasNonNullPredicate = 
rightConditions.exists(canFilterOutNull) ||
   filter.constraints.filter(_.isInstanceOf[IsNotNull])
-.exists(expr => 
join.right.outputSet.intersect(expr.references).nonEmpty)
+.exists(expr => !expr.isInstanceOf[Coalesce] &&
+  rightOuterAttributeSet.intersect(expr.references).nonEmpty)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14447
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63820/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14447
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14447
  
**[Test build #63820 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63820/consoleFull)**
 for PR 14447 at commit 
[`7c94e2b`](https://github.com/apache/spark/commit/7c94e2ba11655cbd9275793f6c069ab3ba844238).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-15 Thread junyangq

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74874929
  
--- Diff: R/pkg/R/functions.R ---
@@ -1143,7 +1139,7 @@ setMethod("minute",
 #' @export
 #' @examples \dontrun{select(df, monotonically_increasing_id())}
 setMethod("monotonically_increasing_id",
-  signature(x = "missing"),
+  signature(),
--- End diff --

Automatic generation of S4 methods is not desirable. I hope this case can 
be better handled by roxygen. For now, I agree (b) is a good solution to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
The right fix is to change the following statements 
```Scala
val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) ||
  filter.constraints.filter(_.isInstanceOf[IsNotNull])
.exists(expr => 
join.left.outputSet.intersect(expr.references).nonEmpty)
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) 
||
  filter.constraints.filter(_.isInstanceOf[IsNotNull])
.exists(expr => 
join.right.outputSet.intersect(expr.references).nonEmpty)
```
to the following ones:
```Scala
val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull)
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
Sorry, my above description is not clear. 

`isnotnull(coalesce(b#227, c#238))` does not filter out `NULL` of `b#227` 
and `c#238`. Only when both are `b#227` and `c#238` are `NULL`, 
`coalesce(b#227, c#238)` returns `NULL`. Thus, we are unable to use the 
following two statements to conclude whether left or right has Non-Null 
predicates.

```Scala
filter.constraints.filter(_.isInstanceOf[IsNotNull])
  .exists(expr => join.left.outputSet.intersect(expr.references).nonEmpty)
```
and 
```Scala
filter.constraints.filter(_.isInstanceOf[IsNotNull])
  .exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14392
  
**[Test build #63825 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63825/consoleFull)**
 for PR 14392 at commit 
[`cc708b5`](https://github.com/apache/spark/commit/cc708b549455ad1d850e86198a84060086d30386).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/14580
  
Can you explain `isnotnull(coalesce(b#227, c#238)) does not filter out 
NULL!!!`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/14359
  
Btw, to give back-of-the-envelope estimates, we can look at 2 numbers:
(1) How many nodes will be split on each iteration?
(2) How big is the forest which is serialized and sent to workers on each 
iteration?

For (1), here's an example:
* 1000 features, each with 50 bins -> 5 possible splits
* set maxMemoryInMB = 256 (default)
* regression => 3 Double values per possible split
* 256 * 10^6 / (3 * 5 * 8) = 213 nodes/iteration

This implies that for trees of depth > 8 or so, many iterations will only 
split nodes from 1 or 2 trees.  I.e., we should avoid communicating most trees.

For (2), the forest can be pretty expensive to send.
* Each node:
  * leaf node: 5 Doubles
  * internal node: ~8 Doubles/references + Split
* Split: O(# categories) or 2 values for continuous, say 3 Doubles on 
average
  * => say 8 Doubles/node on average
* 100 trees of depth 8 => 25600 nodes => 1.6MB
* 100 trees of depth 14 => 105MB
* I've heard of many cases of users wanting to fit 500-1000 trees and use 
trees of depth 18-20.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #63824 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63824/consoleFull)**
 for PR 14182 at commit 
[`8844961`](https://github.com/apache/spark/commit/884496153f9aa512bc437c1c23361479b6b2bc7b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14658: [WIP][SPARK-5928] Remote Shuffle Blocks cannot be more t...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14658
  
**[Test build #63823 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63823/consoleFull)**
 for PR 14658 at commit 
[`443aa91`](https://github.com/apache/spark/commit/443aa91cfc2490be9733c78b7cd911f09bedfac6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
```Scala
val df12 = df1.join(df2, $"df1.a" === $"df2.a", "fullouter")
  .select(coalesce($"df1.b", $"df2.c").as("a"), $"df1.b", $"df2.c")
df12.join(df3, "a").explain(true)
```

This is an example to show that we should not eliminate the outer join, 
even if `isnotnull(coalesce(b#227, c#238))` contains the attributes that are 
not in join conditions. 






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-15 Thread junyangq

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74874081
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -181,7 +181,7 @@ getDefaultSqlSource <- function() {
 #' @method createDataFrame default
 #' @note createDataFrame since 1.4.0
 # TODO(davies): support sampling and infer type from NA
-createDataFrame.default <- function(data, schema = NULL, samplingRatio = 
1.0) {
+createDataFrame.default <- function(data, schema = NULL) {
--- End diff --

Oh yes... Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14658: [WIP][SPARK-5928] Remote Shuffle Blocks cannot be...

2016-08-15 Thread witgo

GitHub user witgo opened a pull request:

https://github.com/apache/spark/pull/14658

[WIP][SPARK-5928] Remote Shuffle Blocks cannot be more than 2 GB

## What changes were proposed in this pull request?

Add class `ChunkFetchInputStream` and it have the following effects:
1. flow control
[WIP]
2. reduce memory usage
[WIP]
3. unlimited size 
[WIP]
## How was this patch tested?
WIP



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/witgo/spark SPARK-5928_Shuffle_Blocks_2G

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14658.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14658


commit 443aa91cfc2490be9733c78b7cd911f09bedfac6
Author: Guoqiang Li 
Date:   2016-08-16T04:00:10Z

Remote Shuffle Blocks cannot be more than 2 GB




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74873932
  
--- Diff: R/pkg/R/generics.R ---
@@ -1279,6 +1279,13 @@ setGeneric("spark.naiveBayes", function(data, 
formula, ...) { standardGeneric("s
 #' @export
 setGeneric("spark.survreg", function(data, formula, ...) { 
standardGeneric("spark.survreg") })
 
+#' @rdname spark.gaussianMixture
+#' @export
+setGeneric("spark.gaussianMixture",
+   function(data, formula, ...) {
+ standardGeneric("spark.gaussianMixture")
--- End diff --

It can not fit one line, since ```lint-r``` requires lines should not be 
more than 100 characters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-15 Thread junyangq

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74873867
  
--- Diff: R/pkg/R/mllib.R ---
@@ -298,14 +304,15 @@ setMethod("summary", signature(object = 
"NaiveBayesModel"),
 #' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
 #' predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
 #'
-#' @param data SparkDataFrame for training
-#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
 #'operators are supported, including '~', '.', ':', '+', 
and '-'.
 #'Note that the response variable of formula is empty in 
spark.kmeans.
-#' @param k Number of centers
-#' @param maxIter Maximum iteration number
-#' @param initMode The initialization algorithm choosen to fit the model
-#' @return \code{spark.kmeans} returns a fitted k-means model
+#' @param ... additional argument(s) passed to the method.
--- End diff --

Yeah agreed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14628: [SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to ...

2016-08-15 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14628
  
@holdenk 
I think depth (2) is enough to handle large RDD and bigger depth may add 
cost. I'll append test result later. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14359
  
**[Test build #63822 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63822/consoleFull)**
 for PR 14359 at commit 
[`f79f77c`](https://github.com/apache/spark/commit/f79f77ce49aa797e8432b56fd2ad115540be67cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
None of us is right. : (

```isnotnull(coalesce(b#227, c#238))``` does not filter out `NULL`!!!

Thus, the right fix is to remove the second condition. 
```Scala
filter.constraints.filter(_.isInstanceOf[IsNotNull]).exists(expr => 
join.left.outputSet.intersect(expr.references).nonEmpty)
```
and 
```Scala
filter.constraints.filter(_.isInstanceOf[IsNotNull])
.exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/14359
  
Sorry for the long delay; I've been swamped by other things for a while.  
Re-emerging...

I switched to Stack and then realized Stack has been deprecated in Scala 
2.11, so I reverted to the original NodeQueue.  But I renamed NodeQueue to 
NodeStack to be a bit clearer.

@hhbyyh Any luck testing this at scale?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
I found the root cause. None of us is right. : (


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14647: [WIP][Test only][DEMO][SPARK-6235]Address various 2G lim...

2016-08-15 Thread witgo

Github user witgo commented on the issue:

https://github.com/apache/spark/pull/14647
  
@hvanhovell  
I will submit some small PRs and provide a more high level description of 
them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-08-15 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13758
  
You are right. I missed `UnsafeArrayData` is a subclass of `ArrayData`. We 
can pass `UnsafeArrayData` to an projection.

I have one question.
When we directly generate `UnsafeArrayData` from a primitive array and copy 
it into an `InternalRow` (`serializefromobject_result`), the following two 
operations are required:
1. Copy from a primitive array to `UnsafeArrayData`
2. Copy from `UnsafeArrayData` into `InternalRow` at line 102
On the other hand, this PR requires the following one operation
0. (No copy happens at line 086 since this PR just store a reference to a 
primitive array in `GenericArrayData`)
1. Copy from a primitive array to `InternalRow`  ([this 
PR](https://github.com/apache/spark/pull/13911) performs `Platform.copy Memory` 
without no iteration.

Can we avoid additional copy at 2. when we directly generate 
`UnsafeArrayData` from a primitive array?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14616
  
**[Test build #63821 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63821/consoleFull)**
 for PR 14616 at commit 
[`db84e25`](https://github.com/apache/spark/commit/db84e259749e6b339367fd42305f92a224407399).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-08-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14649
  
Also, if my understanding is correct, we are picking up only single file to 
read footer (see 
[ParquetFileFormat.scala#L217-L225](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L217-L225))
 unless we merge schemas. So, it seems, due to this reason, writing `_metadata` 
or `_common_metadata` is disabled (See 
https://issues.apache.org/jira/browse/SPARK-15719).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...

2016-08-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14649#discussion_r74872775
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -423,6 +425,54 @@ class ParquetFileFormat
   sqlContext.sessionState.newHadoopConf(),
   options)
   }
+
+  override def filterPartitions(
+  filters: Seq[Filter],
+  schema: StructType,
+  conf: Configuration,
+  allFiles: Seq[FileStatus],
+  root: Path,
+  partitions: Seq[Partition]): Seq[Partition] = {
+// Read the "_metadata" file if available, contains all block headers. 
On S3 better to grab
+// all of the footers in a batch rather than having to read every 
single file just to get its
+// footer.
+allFiles.find(_.getPath.getName == 
ParquetFileWriter.PARQUET_METADATA_FILE).map { stat =>
+  val metadata = ParquetFileReader.readFooter(conf, stat, 
ParquetMetadataConverter.NO_FILTER)
+  partitions.map { part =>
+filterByMetadata(
+  filters,
+  schema,
+  conf,
+  root,
+  metadata,
+  part)
+  }.filterNot(_.files.isEmpty)
+}.getOrElse(partitions)
+  }
+
+  private def filterByMetadata(
+  filters: Seq[Filter],
+  schema: StructType,
+  conf: Configuration,
+  root: Path,
+  metadata: ParquetMetadata,
+  partition: Partition): Partition = {
+val blockMetadatas = metadata.getBlocks.asScala
+val parquetSchema = metadata.getFileMetaData.getSchema
+val conjunctiveFilter = filters
+  .flatMap(ParquetFilters.createFilter(schema, _))
+  .reduceOption(FilterApi.and)
+conjunctiveFilter.map { conjunction =>
+  val filteredBlocks = RowGroupFilter.filterRowGroups(
--- End diff --

Do you mind if I ask a question please?

So, if my understanding is correct, Parquet filters rowgroups in both 
normal reader and vectorized reader already 
(https://github.com/apache/spark/pull/13701). Is this doing the same thing in 
Spark-side?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...

2016-08-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14649#discussion_r74872795
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -423,6 +425,54 @@ class ParquetFileFormat
   sqlContext.sessionState.newHadoopConf(),
   options)
   }
+
+  override def filterPartitions(
+  filters: Seq[Filter],
+  schema: StructType,
+  conf: Configuration,
+  allFiles: Seq[FileStatus],
+  root: Path,
+  partitions: Seq[Partition]): Seq[Partition] = {
+// Read the "_metadata" file if available, contains all block headers. 
On S3 better to grab
+// all of the footers in a batch rather than having to read every 
single file just to get its
+// footer.
+allFiles.find(_.getPath.getName == 
ParquetFileWriter.PARQUET_METADATA_FILE).map { stat =>
+  val metadata = ParquetFileReader.readFooter(conf, stat, 
ParquetMetadataConverter.NO_FILTER)
+  partitions.map { part =>
+filterByMetadata(
+  filters,
+  schema,
+  conf,
+  root,
+  metadata,
+  part)
+  }.filterNot(_.files.isEmpty)
+}.getOrElse(partitions)
+  }
+
+  private def filterByMetadata(
+  filters: Seq[Filter],
+  schema: StructType,
+  conf: Configuration,
+  root: Path,
+  metadata: ParquetMetadata,
+  partition: Partition): Partition = {
+val blockMetadatas = metadata.getBlocks.asScala
+val parquetSchema = metadata.getFileMetaData.getSchema
+val conjunctiveFilter = filters
+  .flatMap(ParquetFilters.createFilter(schema, _))
+  .reduceOption(FilterApi.and)
+conjunctiveFilter.map { conjunction =>
+  val filteredBlocks = RowGroupFilter.filterRowGroups(
--- End diff --

Also, doesn't this try to touch many files in driver-side?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14447
  
**[Test build #63820 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63820/consoleFull)**
 for PR 14447 at commit 
[`7c94e2b`](https://github.com/apache/spark/commit/7c94e2ba11655cbd9275793f6c069ab3ba844238).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14626: [SPARK-16519][SPARKR] Handle SparkR RDD generics that cr...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14626
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14626: [SPARK-16519][SPARKR] Handle SparkR RDD generics that cr...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14626
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63819/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14626: [SPARK-16519][SPARKR] Handle SparkR RDD generics that cr...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14626
  
**[Test build #63819 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63819/consoleFull)**
 for PR 14626 at commit 
[`2723eca`](https://github.com/apache/spark/commit/2723ecadcec4baad697639023fba6aafa373f7d6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptro...

2016-08-15 Thread keypointt

Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/14447#discussion_r74871845
  
--- Diff: R/pkg/R/mllib.R ---
@@ -414,6 +421,94 @@ setMethod("predict", signature(object = "KMeansModel"),
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
 
+#' Multilayer Perceptron Classification Model
+#'
+#' \code{spark.mlp} fits a multi-layer perceptron neural network model 
against a SparkDataFrame.
+#' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
+#' Only categorical data is supported.
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/ml-classification-regression.html
+#' #multilayer-perceptron-classifier}{Multilayerperceptron classifier}.
+#'
+#' @param data A \code{SparkDataFrame} of observations and labels for 
model fitting
+#' @param blockSize BlockSize parameter
+#' @param layers Integer vector containing the number of nodes for each 
layer
+#' @param solver Solver parameter, supported options: "gd" (minibatch 
gradient descent) or "l-bfgs"
+#' @param maxIter Maximum iteration number
+#' @param tol Convergence tolerance of iterations
+#' @param stepSize StepSize parameter
+#' @param seed Seed parameter for weights initialization
+#' @return \code{spark.mlp} returns a fitted Multilayer Perceptron 
Classification Model
+#' @rdname spark.mlp
+#' @aliases spark.mlp,SparkDataFrame-method
+#' @name spark.mlp
+#' @seealso \link{read.ml}
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", 
source = "libsvm")
+#'
+#' # fit a Multilayer Perceptron Classification Model
+#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 5, 4, 3), solver 
= "l-bfgs",
+#'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.mlp since 2.1.0
+setMethod("spark.mlp", signature(data = "SparkDataFrame"),
+  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
+   tol = 0.5, stepSize = 1, seed = 1, ...) {
+jobj <- 
callJStatic("org.apache.spark.ml.r.MultilayerPerceptronClassifierWrapper",
+"fit", data@sdf, as.integer(blockSize), 
as.array(layers),
+as.character(solver), as.integer(maxIter), 
as.numeric(tol),
+as.numeric(stepSize), as.integer(seed))
+return(new("MultilayerPerceptronClassificationModel", jobj = 
jobj))
+  })
+
+# Makes predictions from a model produced by spark.mlp().
+
+#' @param newData A SparkDataFrame for testing
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
labeled in a column named
+#' "prediction"
+#' @rdname spark.mlp
+#' @aliases predict,MultilayerPerceptronClassificationModel-method
+#' @export
+#' @note predict(MultilayerPerceptronClassificationModel) since 2.1.0
+setMethod("predict", signature(object = 
"MultilayerPerceptronClassificationModel"),
+  function(object, newData) {
+return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
+  })
+
+# Returns the summary of a Multilayer Perceptron Classification Model 
produced by \code{spark.mlp}
+
+#' @param object A Multilayer Perceptron Classification Model fitted by 
\code{spark.mlp}
--- End diff --

ok, fixing it now. thanks Felix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
Thank you, @nsyca!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13428: [SPARK-12666][CORE] SparkSubmit packages fix for when 'd...

2016-08-15 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/13428
  
Thanks for the review @JoshRosen, I made the requested changes and tested 
it out once more.  I think it is low risk because it is pretty well isolated to 
this particular issue and only improves on how it was before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
Hmm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
Yep. Here is the output.
```scala
scala> val a = Seq((1,2),(2,3)).toDF("a","b").createOrReplaceTempView("A")
scala> val b = Seq((2,5),(3,4)).toDF("a","c").createOrReplaceTempView("B")
scala> sql("select A.A,B.A,A.B,B.C from A full join B on A.A=B.A").show
+++++
|   A|   A|   B|   C|
+++++
|   1|null|   2|null|
|null|   3|null|   4|
|   2|   2|   3|   5|
+++++

scala> sql("select A.A,B.A from A full join B on A.A=B.A where 
coalesce(A.B,B.C) is not null").show
+---+---+
|  A|  A|
+---+---+
|  2|  2|
+---+---+
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptro...

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14447#discussion_r74870955
  
--- Diff: R/pkg/R/mllib.R ---
@@ -414,6 +421,94 @@ setMethod("predict", signature(object = "KMeansModel"),
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
 
+#' Multilayer Perceptron Classification Model
+#'
+#' \code{spark.mlp} fits a multi-layer perceptron neural network model 
against a SparkDataFrame.
+#' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
+#' Only categorical data is supported.
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/ml-classification-regression.html
+#' #multilayer-perceptron-classifier}{Multilayerperceptron classifier}.
+#'
+#' @param data A \code{SparkDataFrame} of observations and labels for 
model fitting
+#' @param blockSize BlockSize parameter
+#' @param layers Integer vector containing the number of nodes for each 
layer
+#' @param solver Solver parameter, supported options: "gd" (minibatch 
gradient descent) or "l-bfgs"
+#' @param maxIter Maximum iteration number
+#' @param tol Convergence tolerance of iterations
+#' @param stepSize StepSize parameter
+#' @param seed Seed parameter for weights initialization
+#' @return \code{spark.mlp} returns a fitted Multilayer Perceptron 
Classification Model
+#' @rdname spark.mlp
+#' @aliases spark.mlp,SparkDataFrame-method
+#' @name spark.mlp
+#' @seealso \link{read.ml}
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", 
source = "libsvm")
+#'
+#' # fit a Multilayer Perceptron Classification Model
+#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 5, 4, 3), solver 
= "l-bfgs",
+#'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.mlp since 2.1.0
+setMethod("spark.mlp", signature(data = "SparkDataFrame"),
+  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
+   tol = 0.5, stepSize = 1, seed = 1, ...) {
+jobj <- 
callJStatic("org.apache.spark.ml.r.MultilayerPerceptronClassifierWrapper",
+"fit", data@sdf, as.integer(blockSize), 
as.array(layers),
+as.character(solver), as.integer(maxIter), 
as.numeric(tol),
+as.numeric(stepSize), as.integer(seed))
+return(new("MultilayerPerceptronClassificationModel", jobj = 
jobj))
+  })
+
+# Makes predictions from a model produced by spark.mlp().
+
+#' @param newData A SparkDataFrame for testing
--- End diff --

oops. add @param for object


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptro...

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14447#discussion_r74870995
  
--- Diff: R/pkg/R/mllib.R ---
@@ -414,6 +421,94 @@ setMethod("predict", signature(object = "KMeansModel"),
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
 
+#' Multilayer Perceptron Classification Model
+#'
+#' \code{spark.mlp} fits a multi-layer perceptron neural network model 
against a SparkDataFrame.
+#' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
+#' Only categorical data is supported.
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/ml-classification-regression.html
+#' #multilayer-perceptron-classifier}{Multilayerperceptron classifier}.
+#'
+#' @param data A \code{SparkDataFrame} of observations and labels for 
model fitting
+#' @param blockSize BlockSize parameter
+#' @param layers Integer vector containing the number of nodes for each 
layer
+#' @param solver Solver parameter, supported options: "gd" (minibatch 
gradient descent) or "l-bfgs"
+#' @param maxIter Maximum iteration number
+#' @param tol Convergence tolerance of iterations
+#' @param stepSize StepSize parameter
+#' @param seed Seed parameter for weights initialization
+#' @return \code{spark.mlp} returns a fitted Multilayer Perceptron 
Classification Model
+#' @rdname spark.mlp
+#' @aliases spark.mlp,SparkDataFrame-method
+#' @name spark.mlp
+#' @seealso \link{read.ml}
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", 
source = "libsvm")
+#'
+#' # fit a Multilayer Perceptron Classification Model
+#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 5, 4, 3), solver 
= "l-bfgs",
+#'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.mlp since 2.1.0
+setMethod("spark.mlp", signature(data = "SparkDataFrame"),
+  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
+   tol = 0.5, stepSize = 1, seed = 1, ...) {
+jobj <- 
callJStatic("org.apache.spark.ml.r.MultilayerPerceptronClassifierWrapper",
+"fit", data@sdf, as.integer(blockSize), 
as.array(layers),
+as.character(solver), as.integer(maxIter), 
as.numeric(tol),
+as.numeric(stepSize), as.integer(seed))
+return(new("MultilayerPerceptronClassificationModel", jobj = 
jobj))
+  })
+
+# Makes predictions from a model produced by spark.mlp().
+
+#' @param newData A SparkDataFrame for testing
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
labeled in a column named
+#' "prediction"
+#' @rdname spark.mlp
+#' @aliases predict,MultilayerPerceptronClassificationModel-method
+#' @export
+#' @note predict(MultilayerPerceptronClassificationModel) since 2.1.0
+setMethod("predict", signature(object = 
"MultilayerPerceptronClassificationModel"),
+  function(object, newData) {
+return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
+  })
+
+# Returns the summary of a Multilayer Perceptron Classification Model 
produced by \code{spark.mlp}
+
+#' @param object A Multilayer Perceptron Classification Model fitted by 
\code{spark.mlp}
--- End diff --

add `@param ... Currently not used` for CRAN check


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread nsyca

Github user nsyca commented on the issue:

https://github.com/apache/spark/pull/14580
  
@dongjoon-hyun, could you please try this on your PR?


val a = Seq((1,2),(2,3)).toDF("a","b").createOrReplaceTempView("A")
val b = Seq((2,5),(3,4)).toDF("a","c").createOrReplaceTempView("B")
sql("select A.A,B.A,A.B,B.C from A full join B on A.A=B.A").show
sql("select A.A,B.A from A full join B on A.A=B.A where coalesce(A.B,B.C) 
is not null").show


How many rows do you get from the last and the second last statements?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptro...

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14447#discussion_r74870693
  
--- Diff: R/pkg/R/mllib.R ---
@@ -414,6 +421,94 @@ setMethod("predict", signature(object = "KMeansModel"),
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
 
+#' Multilayer Perceptron Classification Model
+#'
+#' \code{spark.mlp} fits a multi-layer perceptron neural network model 
against a SparkDataFrame.
+#' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
+#' Only categorical data is supported.
+#' For more details, see
+#' 
\href{http://spark.apache.org/docs/latest/ml-classification-regression.html
+#' #multilayer-perceptron-classifier}{Multilayerperceptron classifier}.
+#'
+#' @param data A \code{SparkDataFrame} of observations and labels for 
model fitting
+#' @param blockSize BlockSize parameter
+#' @param layers Integer vector containing the number of nodes for each 
layer
+#' @param solver Solver parameter, supported options: "gd" (minibatch 
gradient descent) or "l-bfgs"
+#' @param maxIter Maximum iteration number
+#' @param tol Convergence tolerance of iterations
+#' @param stepSize StepSize parameter
+#' @param seed Seed parameter for weights initialization
+#' @return \code{spark.mlp} returns a fitted Multilayer Perceptron 
Classification Model
+#' @rdname spark.mlp
+#' @aliases spark.mlp,SparkDataFrame-method
+#' @name spark.mlp
+#' @seealso \link{read.ml}
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", 
source = "libsvm")
+#'
+#' # fit a Multilayer Perceptron Classification Model
+#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 5, 4, 3), solver 
= "l-bfgs",
+#'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1)
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.mlp since 2.1.0
+setMethod("spark.mlp", signature(data = "SparkDataFrame"),
+  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
+   tol = 0.5, stepSize = 1, seed = 1, ...) {
--- End diff --

We are working on the others in PR #14558, if it's not needed, I think we 
should remove `...` as of now CRAN check will flag this. We can always add 
parameter later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14626: [SPARK-16519][SPARKR] Handle SparkR RDD generics that cr...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14626
  
**[Test build #63819 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63819/consoleFull)**
 for PR 14626 at commit 
[`2723eca`](https://github.com/apache/spark/commit/2723ecadcec4baad697639023fba6aafa373f7d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14641: [Minor] [SparkR] spark.glm weightCol should in the signa...

2016-08-15 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14641
  
I think tests are only passing string, but we should coerce this to be safe.
LGTM



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74870049
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,110 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.gaussianMixture.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @return \code{spark.gaussianMixture} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.gaussianMixture
+#' @name spark.gaussianMixture
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.gaussianMixture(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.gaussianMixture since 2.1.0
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.gaussianMixture", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), tol)
+return(new("GaussianMixtureModel", jobj = jobj))
+  })
+
+#  Get the summary of a multivariate gaussian mixture model
+
+#' @param object A fitted gaussian mixture model
+#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @rdname spark.gaussianMixture
+#' @export
+#' @note summary(GaussianMixtureModel) since 2.1.0
+setMethod("summary", signature(object = "GaussianMixtureModel"),
+  function(object, ...) {
+jobj <- object@jobj
+is.loaded <- callJMethod(jobj, "isLoaded")
+lambda <- unlist(callJMethod(jobj, "lambda"))
+muList <- callJMethod(jobj, "mu")
+sigmaList <- callJMethod(jobj, "sigma")
+k <- callJMethod(jobj, "k")
+dim <- callJMethod(jobj, "dim")
+mu <- c()
+for (i in 1 : k) {
+  start <- (i - 1) * dim + 1
+  end <- i * dim
+  mu[[i]] <- unlist(muList[start : end])
+}
+sigma <- c()
+for (i in 1 : k) {
+  start <- (i - 1) * dim * dim + 1
+  end <- i * dim * dim
+  sigma[[i]] <- t(matrix(sigmaList[start : end], ncol = dim))
+}
+posterior <- if (is.loaded) {
+  NULL
+} else {
+  dataFrame(callJMethod(jobj, "posterior"))
+}
+return(list(lambda = lambda, mu = mu, sigma = sigma,
+   posterior = posterior, is.loaded = is.loaded))
+  })
+
+#  Predicted values based on a gaussian mixture model
+
+#' @param newData SparkDataFrame for testing
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
labels in a column named
+#' "prediction"
+#' @return \code{predict} returns the predicted values

[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74869983
  
--- Diff: R/pkg/R/mllib.R ---
@@ -632,3 +659,110 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
   function(object, newData) {
 return(dataFrame(callJMethod(object@jobj, "transform", 
newData@sdf)))
   })
+
+#' Multivariate Gaussian Mixture Model (GMM)
+#'
+#' Fits multivariate gaussian mixture model against a Spark DataFrame, 
similarly to R's
+#' mvnormalmixEM(). Users can call \code{summary} to print a summary of 
the fitted model,
+#' \code{predict} to make predictions on new data, and 
\code{write.ml}/\code{read.ml}
+#' to save/load fitted models.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.gaussianMixture.
+#' @param k Number of independent Gaussians in the mixture model.
+#' @param maxIter Maximum iteration number
+#' @param tol The convergence tolerance
+#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
+#' @return \code{spark.gaussianMixture} returns a fitted multivariate 
gaussian mixture model
+#' @rdname spark.gaussianMixture
+#' @name spark.gaussianMixture
+#' @seealso mixtools: 
\url{https://cran.r-project.org/web/packages/mixtools/}
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' library(mvtnorm)
+#' set.seed(100)
+#' a <- rmvnorm(4, c(0, 0))
+#' b <- rmvnorm(6, c(3, 4))
+#' data <- rbind(a, b)
+#' df <- createDataFrame(as.data.frame(data))
+#' model <- spark.gaussianMixture(df, ~ V1 + V2, k = 2)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "V1", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.gaussianMixture since 2.1.0
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.gaussianMixture", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 2, maxIter = 100, tol = 0.01) {
+formula <- paste(deparse(formula), collapse = "")
+jobj <- 
callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf,
+formula, as.integer(k), 
as.integer(maxIter), tol)
--- End diff --

add `as.numeric(tol)` if we could, since tol is not in the signature


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74869872
  
--- Diff: R/pkg/R/mllib.R ---
@@ -526,6 +533,24 @@ setMethod("write.ml", signature(object = 
"KMeansModel", path = "character"),
 invisible(callJMethod(writer, "save", path))
   })
 
+#  Save fitted MLlib model to the input path
+
+#' @param path The directory where the model is saved
+#' @param overwrite Overwrites or not if the output path already exists. 
Default is FALSE
+#'  which means throw exception if the output path exists.
+#'
+#' @rdname spark.gaussianMixture
--- End diff --

let's add `@aliases`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74869843
  
--- Diff: R/pkg/R/generics.R ---
@@ -1279,6 +1279,13 @@ setGeneric("spark.naiveBayes", function(data, 
formula, ...) { standardGeneric("s
 #' @export
 setGeneric("spark.survreg", function(data, formula, ...) { 
standardGeneric("spark.survreg") })
 
+#' @rdname spark.gaussianMixture
+#' @export
+setGeneric("spark.gaussianMixture",
+   function(data, formula, ...) {
+ standardGeneric("spark.gaussianMixture")
--- End diff --

does it fit one line, like the others?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14229#discussion_r74869802
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -0,0 +1,207 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import scala.collection.mutable
+
+import org.apache.hadoop.fs.Path
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.SparkException
+import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
+import org.apache.spark.ml.clustering.{LDA, LDAModel}
+import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, 
RegexTokenizer, StopWordsRemover}
+import org.apache.spark.ml.linalg.VectorUDT
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.StringType
+
+
+private[r] class LDAWrapper private (
+val pipeline: PipelineModel,
+val logLikelihood: Double,
+val logPerplexity: Double,
+val vocabulary: Array[String]) extends MLWritable {
+
+  import LDAWrapper._
+
+  private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
+  private val preprocessor: PipelineModel =
+new PipelineModel(s"${Identifiable.randomUID(pipeline.uid)}", 
pipeline.stages.dropRight(1))
+
+  def transform(data: Dataset[_]): DataFrame = {
+pipeline.transform(data).drop(TOKENIZER_COL, STOPWORDS_REMOVER_COL, 
COUNT_VECTOR_COL)
+  }
+
+  def computeLogPerplexity(data: Dataset[_]): Double = {
+lda.logPerplexity(preprocessor.transform(data))
+  }
+
+  lazy val topicIndices: DataFrame = lda.describeTopics(10)
--- End diff --

I think we could add additional parameter if it could be useful. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14506
  
**[Test build #63818 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63818/consoleFull)**
 for PR 14506 at commit 
[`3042af2`](https://github.com/apache/spark/commit/3042af2f0e9ae82e40d14e950a1036b9e417dbc9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-08-15 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13758
  
you can take a look at `GenerateUnsafeProjection`, if the `ArrayData` is 
already an unsafe array, we will copy it directly, no iteration is needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14229#discussion_r74869681
  
--- Diff: R/pkg/R/mllib.R ---
@@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = 
"SparkDataFrame", formula = "formula
 return(new("AFTSurvivalRegressionModel", jobj = jobj))
   })
 
+#' Latent Dirichlet Allocation
+#'
+#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a 
SparkDataFrame. Users can call
+#' \code{summary} to get a summary of the fitted LDA model, 
\code{spark.posterior} to compute
+#' posterior probabilities on new data, \code{spark.perplexity} to compute 
log perplexity on new
+#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
+#'
+#' @param data A SparkDataFrame for training
+#' @param features Features column name, default "features". Either Vector 
format column or String
--- End diff --

could you link to libSVM's ml.Vector-format?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/8880
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63817/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/8880
  
**[Test build #63817 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63817/consoleFull)**
 for PR 8880 at commit 
[`f5af081`](https://github.com/apache/spark/commit/f5af08147ffcfcd883b758219b40baf0eb2e4e16).
 * This patch **fails build dependency tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/8880
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14229#discussion_r74869578
  
--- Diff: R/pkg/R/mllib.R ---
@@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = 
"SparkDataFrame", formula = "formula
 return(new("AFTSurvivalRegressionModel", jobj = jobj))
   })
 
+#' Latent Dirichlet Allocation
+#'
+#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a 
SparkDataFrame. Users can call
+#' \code{summary} to get a summary of the fitted LDA model, 
\code{spark.posterior} to compute
+#' posterior probabilities on new data, \code{spark.perplexity} to compute 
log perplexity on new
+#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
+#'
+#' @param data A SparkDataFrame for training
+#' @param features Features column name, default "features". Either Vector 
format column or String
+#'format column are accepted.
+#' @param k Number of topics, default 10
+#' @param maxIter Maximum iterations, default 20
+#' @param optimizer Optimizer to train an LDA model, "online" or "em", 
default "online"
+#' @param subsamplingRate (For online optimizer) Fraction of the corpus to 
be sampled and used in
+# each iteration of mini-batch gradient descent, in range (0, 1], 
default 0.05
+#' @param topicConcentration concentration parameter (commonly named 
\code{beta} or \code{eta}) for
+#'the prior placed on topic distributions over terms, default -1 
to set automatically on the
+#'Spark side. Use \code{summary} to retrieve the effective 
topicConcentration.
+#' @param docConcentration concentration parameter (commonly named 
\code{alpha}) for the
+#'prior placed on documents distributions over topics 
(\code{theta}), default -1 to set
+#'automatically on the Spark side. Use \code{summary} to retrieve 
the effective
+#'docConcentration.
+#' @param customizedStopWords stopwords that need to be removed from the 
given corpus. Only effected
+#'given training data with string format column.
--- End diff --

right, I think "affects" is the right word to use


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 >

1 - 100 of 502 matches

Mail list logo