date:20160107

[GitHub] spark pull request: [SPARK-12604] [CORE] Addendum - use casting vs...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10641#issuecomment-169831711
  
**[Test build #2351 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2351/consoleFull)**
 for PR 10641 at commit 
[`377fb49`](https://github.com/apache/spark/commit/377fb49a677f7f81699a7a9c05195cec9503af2b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10472#issuecomment-169836993
  
**[Test build #48977 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48977/consoleFull)**
 for PR 10472 at commit 
[`860861c`](https://github.com/apache/spark/commit/860861cb613a2d00a70e4eb699c25b2375c86eda).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12507][Streaming][Document]Expose close...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10453#issuecomment-169837010
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48980/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10652#issuecomment-169837014
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10472#issuecomment-169837133
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12507][Streaming][Document]Expose close...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10453#issuecomment-169837009
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10652#issuecomment-169837015
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48981/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49140391
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,116 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
--- End diff --

Ah yes it should be, I'll ad a docstring test for this method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10653#issuecomment-169841436
  
**[Test build #48986 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48986/consoleFull)**
 for PR 10653 at commit 
[`ade6f5d`](https://github.com/apache/spark/commit/ade6f5d354985f3778e0c8c2da80679c76495f0a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10652#issuecomment-169842106
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48985/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10635#issuecomment-169843491
  
**[Test build #48988 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48988/consoleFull)**
 for PR 10635 at commit 
[`8bdd481`](https://github.com/apache/spark/commit/8bdd48189f96a45db54bc8d11e16107b0d15318f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10609#issuecomment-169833544
  
**[Test build #48979 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48979/consoleFull)**
 for PR 10609 at commit 
[`4e4e9a1`](https://github.com/apache/spark/commit/4e4e9a136ffae30665979df7307a6175188690f7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...

2016-01-07 Thread davies

GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/10653

[SPARK-12700] [SQL] embed condition into SMJ and BroadcastHashJoin

Currently SortMergeJoin and BroadcastHashJoin do not support condition, the 
need a followed Filter for that, the result projection to generate UnsafeRow 
could be very expensive if they generate lots of rows and could be filtered 
mostly by condition.

This PR brings the support of condition for SortMergeJoin and 
BroadcastHashJoin, just like other outer joins do.

This could improve the performance of Q72 by 7x (from 120s to 16.5s).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark filter_join

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10653.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10653


commit a38d623d7d57709f2f26b1189ff699c02bd0ca57
Author: Davies Liu 
Date:   2016-01-07T23:05:50Z

embed condition into SMJ and BroadcastHashJoin




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10653#issuecomment-169838565
  
**[Test build #48984 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48984/consoleFull)**
 for PR 10653 at commit 
[`a38d623`](https://github.com/apache/spark/commit/a38d623d7d57709f2f26b1189ff699c02bd0ca57).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12510][Streaming]Refactor ActorReceiver...

2016-01-07 Thread zsxwing

Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/10457#issuecomment-169838489
  
@tdas forgot to merge? I'm merging it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11938][ML] Expose numFeatures in all ML...

2016-01-07 Thread thunterdb

Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/9936#discussion_r49143685
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -371,6 +378,103 @@ def test_fit_maximize_metric(self):
 self.assertEqual(1.0, bestModelMetric, "Best model has R-squared 
of 1")
 
 
+class RegressorTest(PySparkTestCase):
+
+def setupData(self):
+try:
+self.df
+except AttributeError:
+from pyspark.mllib.linalg import Vectors
+sqlContext = SQLContext(self.sc)
+self.df = sqlContext.createDataFrame([
+(1.0, Vectors.dense(1.0)),
+(0.0, Vectors.sparse(1, [], []))], ["label", "features"])
+
+def test_linear_regression(self):
+self.setupData()
+lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal")
+model = lr.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_decision_tree_regressor(self):
+self.setupData()
+dt = DecisionTreeRegressor(maxDepth=2)
+model = dt.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_random_forest_regressor(self):
+self.setupData()
+rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42)
+model = rf.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_gbt_regressor(self):
+self.setupData()
+gbt = GBTRegressor(maxIter=5, maxDepth=2)
+model = gbt.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+
+class ClassificationTest(PySparkTestCase):
+
+def setupData(self):
+try:
+self.df
+except AttributeError:
+from pyspark.mllib.linalg import Vectors
--- End diff --

is there any reason for putting the import in the code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11938][ML] Expose numFeatures in all ML...

2016-01-07 Thread thunterdb

Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/9936#discussion_r49143733
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -371,6 +378,103 @@ def test_fit_maximize_metric(self):
 self.assertEqual(1.0, bestModelMetric, "Best model has R-squared 
of 1")
 
 
+class RegressorTest(PySparkTestCase):
+
+def setupData(self):
+try:
+self.df
+except AttributeError:
+from pyspark.mllib.linalg import Vectors
+sqlContext = SQLContext(self.sc)
+self.df = sqlContext.createDataFrame([
+(1.0, Vectors.dense(1.0)),
+(0.0, Vectors.sparse(1, [], []))], ["label", "features"])
+
+def test_linear_regression(self):
+self.setupData()
+lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal")
+model = lr.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_decision_tree_regressor(self):
+self.setupData()
+dt = DecisionTreeRegressor(maxDepth=2)
+model = dt.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_random_forest_regressor(self):
+self.setupData()
+rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42)
+model = rf.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_gbt_regressor(self):
+self.setupData()
+gbt = GBTRegressor(maxIter=5, maxDepth=2)
+model = gbt.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+
+class ClassificationTest(PySparkTestCase):
+
+def setupData(self):
+try:
+self.df
+except AttributeError:
+from pyspark.mllib.linalg import Vectors
+sqlContext = SQLContext(self.sc)
+self.df = sqlContext.createDataFrame([
+(1.0, Vectors.dense(1.0, 0.0)),
+(0.0, Vectors.sparse(2, [1], [1.0]))], ["label", 
"features"])
+
+def test_logistic_regression(self):
+self.setupData()
+lr = LogisticRegression(maxIter=5, regParam=0.01)
+model = lr.fit(self.df)
+self.assertEqual(2, model.numFeatures)
+
+def test_decision_tree_classifier(self):
+from pyspark.ml.feature import StringIndexer
--- End diff --

same thing here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9835] [ML] IterativelyReweightedLeastSq...

2016-01-07 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/10639#discussion_r49144625
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/GLMFamilies.scala 
---
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.optim
+
+import org.apache.spark.rdd.RDD
+
+/**
+ * A description of the error distribution and link function to be used in 
the model.
+ * @param link a link function instance
+ */
+private[ml] abstract class Family(val link: Link) extends Serializable {
+
+  /**
+   * Starting value for mu in the IRLS algorithm.
+   */
+  def startingMu(y: Double, yMean: Double): Double = (y + yMean) / 2.0
+
+  /**
+   * Deviance of (y, mu) pair.
+   * Deviance is usually defined as twice the loglikelihood ratio.
+   */
+  def deviance(y: RDD[Double], mu: RDD[Double]): Double
+
+  /** Weights for IRLS steps. */
+  def weights(mu: Double): Double
+
+  /** The working dependent variable. */
+  def z(y: Double, mu: Double, eta: Double): Double
+}
+
+/**
+ * Binomial exponential family distribution.
+ * The default link for the Binomial family is the logit link.
+ * @param link a link function instance
+ */
+private[ml] class Binomial(link: Link = new Logit) extends Family(link) {
+
+  override def startingMu(y: Double, yMean: Double): Double = (y + 0.5) / 
2.0
+
+  override def deviance(y: RDD[Double], mu: RDD[Double]): Double = {
+mu.zip(y).map { case (mu, y) =>
+  val my = 1.0 - y
+  y * math.log(math.max(y, 1.0) / mu) +
+my * math.log(math.max(my, 1.0) / (1.0 - mu))
+}.sum() * 2
+  }
+
+  override def weights(mu: Double): Double = {
+mu * (1 - mu)
+  }
+
+  override def z(y: Double, mu: Double, eta: Double): Double = {
+eta + (y - mu) * link.deriv(mu)
+  }
+}
+
+/**
+ * Poisson exponential family.
+ * The default link for the Poisson family is the log link.
+ * @param link a link function instance
+ */
+private[ml] class Poisson(link: Link = new Logit) extends Family(link) {
--- End diff --

I believe the link function here should default to `Log` not `Logit`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10650#issuecomment-169847124
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48989/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10650#issuecomment-169847116
  
**[Test build #48989 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48989/consoleFull)**
 for PR 10650 at commit 
[`87fc0ff`](https://github.com/apache/spark/commit/87fc0ffb67e6538b2b850e0fd36ba6e2c63fc549).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10650#issuecomment-169847122
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10615#issuecomment-169850361
  
**[Test build #48975 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48975/consoleFull)**
 for PR 10615 at commit 
[`319e0ed`](https://github.com/apache/spark/commit/319e0edb17d02eb994bc1cd104a29df8c47a9c59).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10238#issuecomment-169851002
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48964/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10238#issuecomment-169851001
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

2016-01-07 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/10615#discussion_r49147677
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -0,0 +1,341 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.csv
+
+import java.nio.charset.UnsupportedCharsetException
+import java.io.File
+import java.sql.Timestamp
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql.{DataFrame, QueryTest, Row}
+import org.apache.spark.sql.test.{SQLTestUtils, SharedSQLContext}
+import org.apache.spark.sql.types._
+
+class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
+  private val carsFile = "cars.csv"
+  private val carsFile8859 = "cars_iso-8859-1.csv"
+  private val carsTsvFile = "cars.tsv"
+  private val carsAltFile = "cars-alternative.csv"
+  private val carsUnbalancedQuotesFile = "cars-unbalanced-quotes.csv"
+  private val carsNullFile = "cars-null.csv"
+  private val emptyFile = "empty.csv"
+  private val commentsFile = "comments.csv"
+  private val disableCommentsFile = "disable_comments.csv"
+
+  private def testFile(fileName: String): String = {
+
Thread.currentThread().getContextClassLoader.getResource(fileName).toString
+  }
+
+  /** Verifies data and schema. */
+  private def verifyCars(
+  df: DataFrame,
+  withHeader: Boolean,
+  numCars: Int = 3,
+  numFields: Int = 5,
+  checkHeader: Boolean = true,
+  checkValues: Boolean = true,
+  checkTypes: Boolean = false): Unit = {
+
+val numColumns = numFields
+val numRows = if (withHeader) numCars else numCars + 1
+// schema
+assert(df.schema.fieldNames.length === numColumns)
+assert(df.collect().length === numRows)
+
+if (checkHeader) {
+  if (withHeader) {
+assert(df.schema.fieldNames === Array("year", "make", "model", 
"comment", "blank"))
+  } else {
+assert(df.schema.fieldNames === Array("C0", "C1", "C2", "C3", 
"C4"))
+  }
+}
+
+if (checkValues) {
+  val yearValues = List("2012", "1997", "2015")
+  val actualYears = if (!withHeader) "year" :: yearValues else 
yearValues
+  val years = if (withHeader) df.select("year").collect() else 
df.select("C0").collect()
+
+  years.zipWithIndex.foreach { case (year, index) =>
+if (checkTypes) {
+  assert(year === Row(actualYears(index).toInt))
+} else {
+  assert(year === Row(actualYears(index)))
+}
+  }
+}
+  }
+
+  test("simple csv test") {
+val cars = sqlContext
+  .read
+  .format("csv")
+  .option("header", "false")
+  .load(testFile(carsFile))
+
+verifyCars(cars, withHeader = false, checkTypes = false)
+  }
+
+  test("simple csv test with type inference") {
+val cars = sqlContext
+  .read
+  .format("csv")
+  .option("header", "true")
+  .option("inferSchema", "true")
+  .load(testFile(carsFile))
+
+verifyCars(cars, withHeader = true, checkTypes = true)
+  }
+
+  test("test with alternative delimiter and quote") {
+val cars = sqlContext.read
+  .format("csv")
+  .options(Map("quote" -> "\'", "delimiter" -> "|", "header" -> 
"true"))
+  .load(testFile(carsAltFile))
+
+verifyCars(cars, withHeader = true)
+  }
+
+  test("bad encoding name") {
+val exception = intercept[UnsupportedCharsetException] {
+  sqlContext
+.read
+.format("csv")
+.option("charset", "1-9588-osi")
+.load(testFile(carsFile8859))
+}
+
+assert(exception.getMessage.contains("1-9588-osi"))
+  }
+
+  ignore("test different encoding") {
+

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

2016-01-07 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/10615#discussion_r49147704
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -0,0 +1,341 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.csv
+
+import java.nio.charset.UnsupportedCharsetException
+import java.io.File
+import java.sql.Timestamp
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql.{DataFrame, QueryTest, Row}
+import org.apache.spark.sql.test.{SQLTestUtils, SharedSQLContext}
+import org.apache.spark.sql.types._
+
+class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
+  private val carsFile = "cars.csv"
+  private val carsFile8859 = "cars_iso-8859-1.csv"
+  private val carsTsvFile = "cars.tsv"
+  private val carsAltFile = "cars-alternative.csv"
+  private val carsUnbalancedQuotesFile = "cars-unbalanced-quotes.csv"
+  private val carsNullFile = "cars-null.csv"
+  private val emptyFile = "empty.csv"
+  private val commentsFile = "comments.csv"
+  private val disableCommentsFile = "disable_comments.csv"
+
+  private def testFile(fileName: String): String = {
+
Thread.currentThread().getContextClassLoader.getResource(fileName).toString
+  }
+
+  /** Verifies data and schema. */
+  private def verifyCars(
+  df: DataFrame,
+  withHeader: Boolean,
+  numCars: Int = 3,
+  numFields: Int = 5,
+  checkHeader: Boolean = true,
+  checkValues: Boolean = true,
+  checkTypes: Boolean = false): Unit = {
+
+val numColumns = numFields
+val numRows = if (withHeader) numCars else numCars + 1
+// schema
+assert(df.schema.fieldNames.length === numColumns)
+assert(df.collect().length === numRows)
+
+if (checkHeader) {
+  if (withHeader) {
+assert(df.schema.fieldNames === Array("year", "make", "model", 
"comment", "blank"))
+  } else {
+assert(df.schema.fieldNames === Array("C0", "C1", "C2", "C3", 
"C4"))
+  }
+}
+
+if (checkValues) {
+  val yearValues = List("2012", "1997", "2015")
+  val actualYears = if (!withHeader) "year" :: yearValues else 
yearValues
+  val years = if (withHeader) df.select("year").collect() else 
df.select("C0").collect()
+
+  years.zipWithIndex.foreach { case (year, index) =>
+if (checkTypes) {
+  assert(year === Row(actualYears(index).toInt))
+} else {
+  assert(year === Row(actualYears(index)))
+}
+  }
+}
+  }
+
+  test("simple csv test") {
+val cars = sqlContext
+  .read
+  .format("csv")
+  .option("header", "false")
+  .load(testFile(carsFile))
+
+verifyCars(cars, withHeader = false, checkTypes = false)
+  }
+
+  test("simple csv test with type inference") {
+val cars = sqlContext
+  .read
+  .format("csv")
+  .option("header", "true")
+  .option("inferSchema", "true")
+  .load(testFile(carsFile))
+
+verifyCars(cars, withHeader = true, checkTypes = true)
+  }
+
+  test("test with alternative delimiter and quote") {
+val cars = sqlContext.read
+  .format("csv")
+  .options(Map("quote" -> "\'", "delimiter" -> "|", "header" -> 
"true"))
+  .load(testFile(carsAltFile))
+
+verifyCars(cars, withHeader = true)
+  }
+
+  test("bad encoding name") {
+val exception = intercept[UnsupportedCharsetException] {
+  sqlContext
+.read
+.format("csv")
+.option("charset", "1-9588-osi")
+.load(testFile(carsFile8859))
+}
+
+assert(exception.getMessage.contains("1-9588-osi"))
+  }
+
+  ignore("test different encoding") {
+

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

2016-01-07 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/10615#discussion_r49147610
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala
 ---
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.csv
+
+import java.io.{OutputStreamWriter, ByteArrayOutputStream, StringReader}
+
+import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, 
CsvParser, CsvWriter}
+
+import org.apache.spark.Logging
+
+/**
+  * Read and parse CSV-like input
+  *
+  * @param params Parameters object
+  * @param headers headers for the columns
+  */
+private[sql] abstract class CsvReader(params: CSVParameters, headers: 
Seq[String]) {
+
+  protected lazy val parser: CsvParser = {
+val settings = new CsvParserSettings()
+val format = settings.getFormat
+format.setDelimiter(params.delimiter)
+format.setLineSeparator(params.rowSeparator)
+format.setQuote(params.quote)
+format.setQuoteEscape(params.escape)
+format.setComment(params.comment)
+
settings.setIgnoreLeadingWhitespaces(params.ignoreLeadingWhiteSpaceFlag)
+
settings.setIgnoreTrailingWhitespaces(params.ignoreTrailingWhiteSpaceFlag)
+settings.setReadInputOnSeparateThread(false)
+settings.setInputBufferSize(params.inputBufferSize)
+settings.setMaxColumns(params.maxColumns)
+settings.setNullValue(params.nullValue)
+settings.setMaxCharsPerColumn(params.maxCharsPerColumn)
+if (headers != null) settings.setHeaders(headers: _*)
+
+new CsvParser(settings)
+  }
+}
+
+/**
+  * Converts a sequence of string to CSV string
+  *
+  * @param params Parameters object for configuration
+  * @param headers headers for columns
+  */
+private[sql] class LineCsvWriter(params: CSVParameters, headers: 
Seq[String]) extends Logging {
+  private val writerSettings = new CsvWriterSettings
+  private val format = writerSettings.getFormat
+
+  format.setDelimiter(params.delimiter)
+  format.setLineSeparator(params.rowSeparator)
+  format.setQuote(params.quote)
+  format.setQuoteEscape(params.escape)
+  format.setComment(params.comment)
+
+  writerSettings.setNullValue(params.nullValue)
+  writerSettings.setEmptyValue(params.nullValue)
+  writerSettings.setSkipEmptyLines(true)
+  writerSettings.setQuoteAllFields(false)
+  writerSettings.setHeaders(headers: _*)
+
+  def writeRow(row: Seq[String], includeHeader: Boolean): String = {
+val buffer = new ByteArrayOutputStream()
+val outputWriter = new OutputStreamWriter(buffer)
+val writer = new CsvWriter(outputWriter, writerSettings)
+
+if (includeHeader) {
+  writer.writeHeaders()
+}
+writer.writeRow(row.toArray: _*)
+writer.close()
+buffer.toString.stripLineEnd
+  }
+}
+
+/**
+  * Parser for parsing a line at a time. Not efficient for bulk data.
+  *
+  * @param params Parameters object
+  */
+private[sql] class LineCsvReader(params: CSVParameters)
+  extends CsvReader(params, null) {
+  /**
+* parse a line
+*
+* @param line a String with no newline at the end
+* @return array of strings where each string is a field in the CSV 
record
+*/
+  def parseLine(line: String): Array[String] = {
+parser.beginParsing(new StringReader(line))
+val parsed = parser.parseNext()
+parser.stopParsing()
+parsed
+  }
+}
+
+/**
+  * Parser for parsing lines in bulk. Use this when efficiency is desired.
+  *
+  * @param iter iterator over lines in the file
+  * @param params Parameters object
+  * @param headers headers for the columns
+  */
+private[sql] class BulkCsvReader(
+iter: Iterator[String],
+params: CSVParameters,

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49149092
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,116 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
--- End diff --

Ah seems that the JavaModelWraper call method being used won't work on the 
workers. I'll have to port the predict method over.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12688][SQL] Fix spill size metric in un...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10634#issuecomment-169857794
  
**[Test build #48992 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48992/consoleFull)**
 for PR 10634 at commit 
[`416d73d`](https://github.com/apache/spark/commit/416d73d954155ebff8f5f75c99cbfc61a24ad818).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10650#issuecomment-169857832
  
**[Test build #48991 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48991/consoleFull)**
 for PR 10650 at commit 
[`87fc0ff`](https://github.com/apache/spark/commit/87fc0ffb67e6538b2b850e0fd36ba6e2c63fc549).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...

2016-01-07 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/10652#issuecomment-169861623
  
jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...

2016-01-07 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/10635#issuecomment-169861548
  
Does this actually let you use one source to compile against both versions 
of Spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12507][Streaming][Document]Expose close...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10453#issuecomment-16984
  
**[Test build #48980 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48980/consoleFull)**
 for PR 10453 at commit 
[`28a750d`](https://github.com/apache/spark/commit/28a750d61c058e537a8ca44babb3ff0f4b54f3b3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...

2016-01-07 Thread iyounus

Github user iyounus commented on a diff in the pull request:

https://github.com/apache/spark/pull/10274#discussion_r49140607
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
@@ -94,8 +110,7 @@ private[ml] class WeightedLeastSquares(
   if (standardizeFeatures) {
 lambda *= aVar(j - 2)
   }
-  if (standardizeLabel) {
-// TODO: handle the case when bStd = 0
+  if (standardizeLabel && bStd != 0) {
--- End diff --

@dbtsai The problem here is that for regularized regression in R, I need to 
use `glmnet`. But for this specific case (constant label, no intercept and no 
regularization) the results from `glmnet` do no match with `lm`. So I see a 
discrepancy within R itself. Have a look at the following R code:

```
A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)  
b <- c(17, 17, 17, 17)  
w <- c(1, 2, 3, 4)  
df <- as.data.frame(cbind(A, b))

lm.model <- lm(b ~ . -1, data=df, weights=w)
print(as.vector(coef(lm.model)))
[1] -9.221298  3.394343

glm.model <- glmnet(A, b, weights=w, intercept=FALSE, lambda=0,
standardize=FALSE, alpha=0, thresh=1E-14)
print(as.vector(coef(glm.model)))
[1] 0 0 0
```

Note that in this example, I expect same results from both `lm` and 
`glmnet` because I've set `lambda=0` in `glmnet`. (BTW `standardize` has not 
effect here.) It seems to me that `glmnet` just sets all coefficients to zero 
if label is constant and intercept is not included. This is true even if I 
include regularization.

Right now `WeightedLeastSquares` (without regularization) matches with 
`lm`, and I think this is the correct behaviour given my understanding of the 
normal equation. With regularization, it should still give some non-zero 
coefficients, which is does. I don't know why `glmnet` behaves differently, but 
I don't think we should try to match that in this particular case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11923][ML] Python API for ml.feature.Ch...

2016-01-07 Thread thunterdb

Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10186#issuecomment-169840549
  
LGTM cc @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10653#issuecomment-169842587
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48984/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10873] Support column sort and search f...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10648#issuecomment-169842729
  
**[Test build #48976 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48976/consoleFull)**
 for PR 10648 at commit 
[`4322851`](https://github.com/apache/spark/commit/4322851fa7a253e7422c8f910d96a0f99a3728cd).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...

2016-01-07 Thread maropu

Github user maropu commented on the pull request:

https://github.com/apache/spark/pull/10635#issuecomment-169842632
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12576][SQL] Enable expression parsing i...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10649#issuecomment-169846366
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48983/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12576][SQL] Enable expression parsing i...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10649#issuecomment-169846245
  
**[Test build #48983 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48983/consoleFull)**
 for PR 10649 at commit 
[`c2b35b7`](https://github.com/apache/spark/commit/c2b35b7efdd80ab4930b46a437bb9289c87b5206).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12576][SQL] Enable expression parsing i...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10649#issuecomment-169846362
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

2016-01-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/10615#issuecomment-169855191
  
Cool!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10653#issuecomment-169859848
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12604] [CORE] Addendum - use casting vs...

2016-01-07 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10641#issuecomment-169859912
  
Alright merging this. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10653#issuecomment-169859850
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48986/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10651#issuecomment-169836663
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48982/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12507][Streaming][Document]Expose close...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10453#issuecomment-169836865
  
**[Test build #48980 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48980/consoleFull)**
 for PR 10453 at commit 
[`28a750d`](https://github.com/apache/spark/commit/28a750d61c058e537a8ca44babb3ff0f4b54f3b3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10651#issuecomment-169836657
  
**[Test build #48982 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48982/consoleFull)**
 for PR 10651 at commit 
[`9582e49`](https://github.com/apache/spark/commit/9582e49a5a5a5de2aed3c56adbd6ec54651115b4).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread thunterdb

Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49140159
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,116 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
--- End diff --

Also, maybe you can add a test for this case in the docstring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread thunterdb

Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49140117
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,116 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
--- End diff --

I am not sure I understand this line, shouldn't it be `x.map(self.predict)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12638] [API DOC] Parameter explanation ...

2016-01-07 Thread Wenpei

Github user Wenpei commented on the pull request:

https://github.com/apache/spark/pull/10587#issuecomment-169839655
  
@srowen it pass test now.   ready for merge.

Thanks for review.

Wenpei


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12510][Streaming]Refactor ActorReceiver...

2016-01-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10457


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...

2016-01-07 Thread ehsanmok

Github user ehsanmok commented on the pull request:

https://github.com/apache/spark/pull/9916#issuecomment-169842052
  
@mengxr @srowen @jkbradley  Why reviewing this simple thing which is 
important for my application is taking so long?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10652#issuecomment-169842103
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....

2016-01-07 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/10650#issuecomment-169844423
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9835] [ML] IterativelyReweightedLeastSq...

2016-01-07 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/10639#discussion_r49144794
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 ---
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.optim
+
+import org.apache.spark.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.mllib.linalg.BLAS._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Model fitted by [[IterativelyReweightedLeastSquares]].
+ * @param coefficients model coefficients
+ * @param intercept model intercept
+ */
+private[ml] class IterativelyReweightedLeastSquaresModel(
+val coefficients: DenseVector,
+val intercept: Double) extends Serializable
+
+/**
+ * Fits a generalized linear model (GLM) for a given family using
+ * iteratively reweighted least squares (IRLS).
+ */
+private[ml] class IterativelyReweightedLeastSquares(
+val family: Family,
+val fitIntercept: Boolean,
+val regParam: Double,
+val standardizeFeatures: Boolean,
+val standardizeLabel: Boolean,
+val maxIter: Int,
+val tol: Double) extends Logging with Serializable {
+
+  def fit(instances: RDD[Instance]): 
IterativelyReweightedLeastSquaresModel = {
+
+val y = instances.map(_.label).persist(StorageLevel.MEMORY_AND_DISK)
+val yMean = y.reduce(_ + _) / y.count()
+var mu = y.map { yi => family.startingMu(yi, yMean) }
+var eta = mu.map { mu => family.link.link(mu) }
--- End diff --

Pre-computing `eta` here seems unnecessary since it is re-assigned within 
the while loop before it is used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12510][Streaming]Refactor ActorReceiver...

2016-01-07 Thread tdas

Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/10457#issuecomment-169855024
  
I was having trouble with setting up the Exceeded Github API rate limit. 
Thanks for merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....

2016-01-07 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/10650#issuecomment-169854996
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10609#issuecomment-169855045
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48979/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10609#issuecomment-169855043
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10609#issuecomment-169854898
  
**[Test build #48979 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48979/consoleFull)**
 for PR 10609 at commit 
[`4e4e9a1`](https://github.com/apache/spark/commit/4e4e9a136ffae30665979df7307a6175188690f7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10653#issuecomment-169859598
  
**[Test build #48986 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48986/consoleFull)**
 for PR 10653 at commit 
[`ade6f5d`](https://github.com/apache/spark/commit/ade6f5d354985f3778e0c8c2da80679c76495f0a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12692][BUILD] Scala style: check no whi...

2016-01-07 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/10643#issuecomment-169861163
  
warnings are displayed like as follows.

```
[warn] 
/home/sarutak/work/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/NumberConverter.scala:125:29:
 Space before token ,
[warn] 
/home/sarutak/work/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:52:20:
 Space before token :
[warn] 
/home/sarutak/work/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:119:23:
 Space before token :
[warn] 
/home/sarutak/work/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:389:22:
 Space before token :
[warn] 
/home/sarutak/work/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala:206:39:
 Space before token ,
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10635#issuecomment-169861160
  
**[Test build #48988 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48988/consoleFull)**
 for PR 10635 at commit 
[`8bdd481`](https://github.com/apache/spark/commit/8bdd48189f96a45db54bc8d11e16107b0d15318f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...

2016-01-07 Thread tgravescs

Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/10651#issuecomment-169833274
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10472#issuecomment-169837135
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48977/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12576][SQL] Enable expression parsing i...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10649#issuecomment-169838962
  
**[Test build #48983 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48983/consoleFull)**
 for PR 10649 at commit 
[`c2b35b7`](https://github.com/apache/spark/commit/c2b35b7efdd80ab4930b46a437bb9289c87b5206).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10873] Support column sort and search f...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10648#issuecomment-169842836
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48976/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10873] Support column sort and search f...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10648#issuecomment-169842834
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10650#issuecomment-169845842
  
**[Test build #48989 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48989/consoleFull)**
 for PR 10650 at commit 
[`87fc0ff`](https://github.com/apache/spark/commit/87fc0ffb67e6538b2b850e0fd36ba6e2c63fc549).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2016-01-07 Thread gracew

Github user gracew commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r49144128
  
--- Diff: python/pyspark/__init__.py ---
@@ -36,6 +36,53 @@
   Finer-grained cache persistence levels.
 
 """
+import os
+import re
+import sys
+
+from os.path import isfile, join
+
+import xml.etree.ElementTree as ET
+
+if os.environ.get("SPARK_HOME") is None:
+raise ImportError("Environment variable SPARK_HOME is undefined.")
+
+spark_home = os.environ['SPARK_HOME']
+pom_xml_file_path = join(spark_home, 'pom.xml')
+snapshot_version = None
+
+if isfile(pom_xml_file_path):
+try:
+tree = ET.parse(pom_xml_file_path)
+root = tree.getroot()
+version_tag = root[4].text
+snapshot_version = version_tag[:5]
+except:
+raise ImportError("Could not read the spark version, because 
pom.xml file" +
+  " could not be read.")
+else:
+try:
+lib_file_path = join(spark_home, "lib")
--- End diff --

@alope107 , would you mind updating this PR to remove the pom_xml_file_path 
branch? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9835] [ML] IterativelyReweightedLeastSq...

2016-01-07 Thread sethah

Github user sethah commented on the pull request:

https://github.com/apache/spark/pull/10639#issuecomment-169848523
  
@yanboliang Could you post a link to a reference paper? I find 
documentation on IRLS scattered, so it would be nice to have something concrete 
to point to.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12701] [CORE] FileAppender should use j...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10654#issuecomment-169852147
  
**[Test build #48990 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48990/consoleFull)**
 for PR 10654 at commit 
[`d937d09`](https://github.com/apache/spark/commit/d937d09f3f5aab96361cee93d0a376c25c72).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12604] [CORE] Addendum - use casting vs...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10641#issuecomment-169853728
  
**[Test build #2351 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2351/consoleFull)**
 for PR 10641 at commit 
[`377fb49`](https://github.com/apache/spark/commit/377fb49a677f7f81699a7a9c05195cec9503af2b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10635#issuecomment-169861338
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48988/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10635#issuecomment-169861337
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9819][Streaming][Documentation] Clarify...

2016-01-07 Thread tdas

Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/8103#issuecomment-169832635
  
Sorry i forgot about this PR completely. Just one more nit that i commented 
on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10651#issuecomment-169832596
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9819][Streaming][Documentation] Clarify...

2016-01-07 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/8103#discussion_r49138016
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala
 ---
@@ -336,7 +336,8 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, 
V)])(
* However, it is applicable to only "invertible reduce functions".
* Hash partitioning is used to generate the RDDs with Spark's default 
number of partitions.
* @param reduceFunc associative reduce function
-   * @param invReduceFunc inverse function
+   * @param invReduceFunc inverse function; such that for all x, 
invertible y:
+   *  `invReduceFunc(reduceFunc(x, y), y) = x`
--- End diff --

Why not
reduceFunc("x", "y") = "xy"   ... y is always added to right
inverseReduceFunc("xy", "x") = "y"... x is always removed from left



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...

2016-01-07 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/10652

[SPARK-12699][SPARKR] R driver process should start in a clean state

Currently we have R worker process launched with the --vanilla option that 
brings it up in a clean state (without init profile or workspace data, 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html). However, 
the R process for the Spark driver is not.

We should do that because
1. That would make driver consistent with the worker process in R - for 
instance, a library would not be load in driver but not worker
2. Since SparkR depends on .libPath and .First() it could be broken by 
something in the user workspace, for example

Here are the changes proposed:
1. When starting `sparkR` shell (except: allow save/restore workspace, 
since the driver/shell is local)
2. When launching R driver in cluster mode
3. In cluster mode, when calling R to install shipped R package

This is discussed in PR #10171

@shivaram @sun-rui 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rvanilla

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10652.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10652


commit c3488c9eda1f731c24769f20eb570d97e4aa5939
Author: felixcheung 
Date:   2016-01-07T09:13:54Z

add R command line options

commit 24fee57e42beec3315979b8db4d817474bcd4baa
Author: felixcheung 
Date:   2016-01-07T22:40:50Z

allow save/restore user workspace when running shell




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10651#issuecomment-169832598
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48978/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10651#issuecomment-169836393
  
**[Test build #48982 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48982/consoleFull)**
 for PR 10651 at commit 
[`9582e49`](https://github.com/apache/spark/commit/9582e49a5a5a5de2aed3c56adbd6ec54651115b4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-07 Thread thunterdb

Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10570#discussion_r49139498
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
 ---
@@ -57,8 +57,8 @@ trait ExpressionEvalHelper extends 
GeneratorDrivenPropertyChecks {
 (result, expected) match {
   case (result: Array[Byte], expected: Array[Byte]) =>
 java.util.Arrays.equals(result, expected)
-  case (result: Double, expected: Spread[Double]) =>
-expected.isWithin(result)
+  case (result: Double, expected: Spread[_]) => // Can't use 
Spread[Double] b/c of erasure
--- End diff --

I see. Sadly, I think this is not going to work here without extra work, 
and then it is not going to do what you want. This version of scalatest uses 
manifest to encode type information, and you would have to define it manually 
in this context:
```scala
implicit val x: Manifest[Int] = ???
stream shouldBe a [ReceiverInputDStream[Int @unchecked]]
```
but then the scalatest library is not aware of the `unchecked` annotation, 
and still throws a warning. Let's just have `_` in the suite file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10651#issuecomment-169836662
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...

2016-01-07 Thread thunterdb

Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10602#discussion_r49140273
  
--- Diff: python/pyspark/mllib/fpm.py ---
@@ -130,15 +133,21 @@ def train(cls, data, minSupport=0.1, 
maxPatternLength=10, maxLocalProjDBSize=320
 """
 Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
 
-:param data: The input data set, each element contains a sequnce 
of itemsets.
-:param minSupport: the minimal support level of the sequential 
pattern, any pattern appears
-more than  (minSupport * size-of-the-dataset) times will be 
output (default: `0.1`)
-:param maxPatternLength: the maximal length of the sequential 
pattern, any pattern appears
-less than maxPatternLength will be output. (default: `10`)
-:param maxLocalProjDBSize: The maximum number of items (including 
delimiters used in
-the internal storage format) allowed in a projected database 
before local
-processing. If a projected database exceeds this size, another
-iteration of distributed prefix growth is run. (default: 
`3200`)
+:param data:
+  The input data set, each element contains a sequnce of itemsets.
+:param minSupport:
+  The minimal support level of the sequential pattern, any pattern 
appears more than
--- End diff --

the lines below have indentation issues


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...

2016-01-07 Thread thunterdb

Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10602#discussion_r49140295
  
--- Diff: python/pyspark/mllib/recommendation.py ---
@@ -239,6 +239,17 @@ def train(cls, ratings, rank, iterations=5, 
lambda_=0.01, blocks=-1, nonnegative
 product of two lower-rank matrices of a given rank (number of 
features). To solve for these
 features, we run a given number of iterations of ALS. This is done 
using a level of
 parallelism given by `blocks`.
+   
+   :param iterations:
--- End diff --

indentation issues?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...

2016-01-07 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/10652#issuecomment-169839126
  
jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...

2016-01-07 Thread zsxwing

Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/10609#issuecomment-169840672
  
By the way, I will send another PR for branch 1.6 due to the conflicts of 
MimaExcludes.scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10653#issuecomment-169842586
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10653#issuecomment-169842552
  
**[Test build #48984 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48984/consoleFull)**
 for PR 10653 at commit 
[`a38d623`](https://github.com/apache/spark/commit/a38d623d7d57709f2f26b1189ff699c02bd0ca57).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10609#issuecomment-169843683
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10609#issuecomment-169843686
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48987/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...

2016-01-07 Thread zsxwing

Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/10609#issuecomment-169843942
  
retest this please



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12701] [CORE] FileAppender should use j...

2016-01-07 Thread BryanCutler

GitHub user BryanCutler opened a pull request:

https://github.com/apache/spark/pull/10654

[SPARK-12701] [CORE]  FileAppender should use join to ensure writing thread 
completion

Changed Logging FileAppender to use join in `awaitTermination` to ensure 
that thread is properly finished before returning.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/BryanCutler/spark 
fileAppender-join-thread-SPARK-12701

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10654.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10654


commit d937d09f3f5aab96361cee93d0a376c25c72
Author: Bryan Cutler 
Date:   2016-01-08T00:19:47Z

[SPARK-12701] Changed FileAppender to use join to sync thread completion 
instead of wait/notifyAll




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...

2016-01-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10238#issuecomment-169850939
  
**[Test build #48964 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48964/consoleFull)**
 for PR 10238 at commit 
[`123d958`](https://github.com/apache/spark/commit/123d958ba05a36aebb2548f04418153979d243ed).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10615#issuecomment-169850601
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48975/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

2016-01-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10615#issuecomment-169850598
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...

2016-01-07 Thread mohitjaggi

Github user mohitjaggi commented on the pull request:

https://github.com/apache/spark/pull/10615#issuecomment-169852430
  
this is great...thanks @falaki 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11938][ML] Expose numFeatures in all ML...

2016-01-07 Thread Lewuathe

Github user Lewuathe commented on a diff in the pull request:

https://github.com/apache/spark/pull/9936#discussion_r49147215
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -371,6 +378,103 @@ def test_fit_maximize_metric(self):
 self.assertEqual(1.0, bestModelMetric, "Best model has R-squared 
of 1")
 
 
+class RegressorTest(PySparkTestCase):
+
+def setupData(self):
+try:
+self.df
+except AttributeError:
+from pyspark.mllib.linalg import Vectors
+sqlContext = SQLContext(self.sc)
+self.df = sqlContext.createDataFrame([
+(1.0, Vectors.dense(1.0)),
+(0.0, Vectors.sparse(1, [], []))], ["label", "features"])
+
+def test_linear_regression(self):
+self.setupData()
+lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal")
+model = lr.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_decision_tree_regressor(self):
+self.setupData()
+dt = DecisionTreeRegressor(maxDepth=2)
+model = dt.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_random_forest_regressor(self):
+self.setupData()
+rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42)
+model = rf.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_gbt_regressor(self):
+self.setupData()
+gbt = GBTRegressor(maxIter=5, maxDepth=2)
+model = gbt.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+
+class ClassificationTest(PySparkTestCase):
+
+def setupData(self):
+try:
+self.df
+except AttributeError:
+from pyspark.mllib.linalg import Vectors
--- End diff --

`Vectors` and `StringIndexer` is not used in any other place. It is better 
not to expand the scope in my though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12604] [CORE] Addendum - use casting vs...

2016-01-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10641


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 474 matches

Mail list logo