[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...

2015-01-02 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3861#discussion_r22417818
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -998,7 +998,7 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
*/
   @DeveloperApi
   override def requestExecutors(numAdditionalExecutors: Int): Boolean = {
-assert(master.contains(yarn) || dynamicAllocationTesting,
+assert(master.contains(mesos) || master.contains(yarn) || 
dynamicAllocationTesting,
   Requesting executors is currently only supported in YARN mode)
--- End diff --

Change this message to be ... only supported in YARN or Mesos modes, and 
the message below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5052] Add common/base classes to fix gu...

2015-01-02 Thread elmer-garduno
Github user elmer-garduno commented on the pull request:

https://github.com/apache/spark/pull/3874#issuecomment-68535926
  
I tried that before using spark.files.userClassPathFirst, but it resulted 
in a java.lang.NoClassDefFoundError: org/apache/spark/Partition ([full stack 
trace](https://gist.github.com/elmer-garduno/e65e3d992357253c6111)), which 
seemed bad enough to not go that way, but maybe someone else here knows the 
correct way to achieve it.

   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3876#issuecomment-68549947
  
  [Test build #24997 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24997/consoleFull)
 for   PR 3876 at commit 
[`e0cf9ef`](https://github.com/apache/spark/commit/e0cf9ef44a7c5b324158325d59acbea7236f9203).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-01-02 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/3879#issuecomment-68549708
  
Hi @hxfeng did you mean to send this in?  I don't see any code change, just 
an empty merge commit.  Would you mind closing this pull request if it was sent 
accidentally?

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5057]Add more details in log when using...

2015-01-02 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/3875#issuecomment-68549854
  
Matches error message from 20 lines up, so LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md

2015-01-02 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/3876#issuecomment-68549893
  
Jenkins this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3879#issuecomment-68526639
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Branch 1.2

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3880#issuecomment-68527633
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68572495
  
  [Test build #24998 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24998/consoleFull)
 for   PR 3871 at commit 
[`b4415ea`](https://github.com/apache/spark/commit/b4415ea70055e8ca2c0444cf964b696f0e1e410d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor] make-distribution.sh using build/mvn

2015-01-02 Thread brennonyork
Github user brennonyork commented on the pull request:

https://github.com/apache/spark/pull/3867#issuecomment-68568947
  
Looks good to me. As an aside I remember @pwendell mentioning on the dev 
mailing list that all PR's *should* have an associated JIRA ticket. Is there 
one for this? If not, might be something you should add and link to. Not sure 
if they'll be closing future PR's without associated JIRA's.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68572497
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24998/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3325][Streaming] Add a parameter to the...

2015-01-02 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/3237#issuecomment-68571707
  
The other PR #3865 has been merged. Mind closing this PR? Thanks for all 
the effort!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md

2015-01-02 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/3876#issuecomment-68571771
  
Good catch. Merging this. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3325][Streaming] Add a parameter to the...

2015-01-02 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/3865#issuecomment-68571673
  
I have merged this. Thanks all!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5058] Updated broken links

2015-01-02 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/3877#issuecomment-68575090
  
Jenkins, this is ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md

2015-01-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3876


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3325][Streaming] Add a parameter to the...

2015-01-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3865


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68580983
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25001/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22429027
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
 ---
@@ -17,23 +17,69 @@
 
 package org.apache.spark.mllib.stat.impl
 
-import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, 
det, pinv}
+import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, 
eigSym}
 
-/** 
-   * Utility class to implement the density function for multivariate 
Gaussian distribution.
-   * Breeze provides this functionality, but it requires the Apache 
Commons Math library,
-   * so this class is here so-as to not introduce a new dependency in 
Spark.
-   */
+import org.apache.spark.mllib.util.MLUtils
+
+/*
+ * This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ * the event that the covariance matrix is singular, the density will be 
computed in a
+ * reduced dimensional subspace under which the distribution is supported.
+ * (see 
http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case)
+ * 
+ * @param mu The mean vector of the distribution
+ * @param sigma The covariance matrix of the distribution
+ */
 private[mllib] class MultivariateGaussian(
 val mu: DBV[Double], 
 val sigma: DBM[Double]) extends Serializable {
-  private val sigmaInv2 = pinv(sigma) * -0.5
-  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * 
math.pow(det(sigma), -0.5)
-
+
+  /**
+   * Compute distribution dependent constants:
+   *sigmaInv2 = (-1/2) * inv(sigma)
+   *u = (2*pi)^(-k/2) * det(sigma)^(-1/2) 
+   */
+  private val (sigmaInv2: DBM[Double], u: Double) = 
calculateCovarianceConstants
+  
   /** Returns density of this multivariate Gaussian at given point, x */
   def pdf(x: DBV[Double]): Double = {
 val delta = x - mu
-val deltaTranspose = new Transpose(delta)
-U * math.exp(deltaTranspose * sigmaInv2 * delta)
+u * math.exp(delta.t * sigmaInv2 * delta)
+  }
+  
+  /**
+   * Calculate distribution dependent components used for the density 
function:
+   *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * 
(x-mu).t * inv(sigma) * (x-mu) )
+   * where k is length of the mean vector.
+   * 
+   * We here compute distribution-fixed parts 
+   *  (2*pi)^(-k/2) * det(sigma)^(-1/2)
+   * and
+   *  (-1/2) * inv(sigma)
+   *  
+   * Both the determinant and the inverse can be computed from the 
singular value decomposition
+   * of sigma.  Noting that covariance matrices are always symmetric and 
positive semi-definite,
+   * we can use the eigendecomposition.
+   * 
+   * To guard against singular covariance matrices, this method computes 
both the 
+   * pseudo-determinant and the pseudo-inverse (Moore-Penrose).  Singular 
values are considered
+   * to be non-zero only if they exceed a tolerance based on machine 
precision, matrix size, and
+   * relation to the maximum singular value (same tolerance used by, e.g., 
Octave).
+   */
+  private def calculateCovarianceConstants: (DBM[Double], Double) = {
+val eigSym.EigSym(d, u) = eigSym(sigma) // sigma = u * diag(d) * u.t
+
+// For numerical stability, values are considered to be non-zero only 
if they exceed tol.
+// This prevents any inverted value from exceeding (eps * n * 
max(d))^-1
+val tol = MLUtils.EPSILON * max(d) * d.length
+
+// pseudo-determinant is product of all non-zero eigenvalues
+val pdetSigma = d.activeValuesIterator.filter(_  tol).reduce(_ * _)
--- End diff --

If all singular values are = tol, then this will throw an 
UnsupportedOperationException.  Could you perhaps catch it and throw a more 
meaningful error if that happens?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22429029
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussianSuite.scala
 ---
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.impl
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.{Vectors, Matrices}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class MultivariateGaussianSuite extends FunSuite with 
MLlibTestSparkContext {
+  test(univariate) {
+val x = Vectors.dense(0.0).toBreeze.toDenseVector
+
+val mu = Vectors.dense(0.0).toBreeze.toDenseVector
+var sigma = Matrices.dense(1, 1, Array(1.0)).toBreeze.toDenseMatrix
+var dist = new MultivariateGaussian(mu, sigma)
+assert(dist.pdf(x) ~== 0.39894 absTol 1E-5)
+
+sigma = Matrices.dense(1, 1, Array(4.0)).toBreeze.toDenseMatrix
+dist = new MultivariateGaussian(mu, sigma)
+assert(dist.pdf(x) ~== 0.19947 absTol 1E-5)
+  }
+  
+  test(multivariate) {
+val x = Vectors.dense(0.0, 0.0).toBreeze.toDenseVector
+
+val mu = Vectors.dense(0.0, 0.0).toBreeze. toDenseVector
--- End diff --

typo: space between . and toDenseVector


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22429025
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
 ---
@@ -17,23 +17,69 @@
 
 package org.apache.spark.mllib.stat.impl
 
-import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, 
det, pinv}
+import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, 
eigSym}
 
-/** 
-   * Utility class to implement the density function for multivariate 
Gaussian distribution.
-   * Breeze provides this functionality, but it requires the Apache 
Commons Math library,
-   * so this class is here so-as to not introduce a new dependency in 
Spark.
-   */
+import org.apache.spark.mllib.util.MLUtils
+
+/*
+ * This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ * the event that the covariance matrix is singular, the density will be 
computed in a
+ * reduced dimensional subspace under which the distribution is supported.
+ * (see 
http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case)
+ * 
+ * @param mu The mean vector of the distribution
+ * @param sigma The covariance matrix of the distribution
+ */
 private[mllib] class MultivariateGaussian(
 val mu: DBV[Double], 
 val sigma: DBM[Double]) extends Serializable {
-  private val sigmaInv2 = pinv(sigma) * -0.5
-  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * 
math.pow(det(sigma), -0.5)
-
+
+  /**
+   * Compute distribution dependent constants:
+   *sigmaInv2 = (-1/2) * inv(sigma)
+   *u = (2*pi)^(-k/2) * det(sigma)^(-1/2) 
+   */
+  private val (sigmaInv2: DBM[Double], u: Double) = 
calculateCovarianceConstants
+  
   /** Returns density of this multivariate Gaussian at given point, x */
   def pdf(x: DBV[Double]): Double = {
 val delta = x - mu
-val deltaTranspose = new Transpose(delta)
-U * math.exp(deltaTranspose * sigmaInv2 * delta)
+u * math.exp(delta.t * sigmaInv2 * delta)
+  }
+  
+  /**
+   * Calculate distribution dependent components used for the density 
function:
+   *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * 
(x-mu).t * inv(sigma) * (x-mu) )
+   * where k is length of the mean vector.
+   * 
+   * We here compute distribution-fixed parts 
+   *  (2*pi)^(-k/2) * det(sigma)^(-1/2)
+   * and
+   *  (-1/2) * inv(sigma)
+   *  
+   * Both the determinant and the inverse can be computed from the 
singular value decomposition
+   * of sigma.  Noting that covariance matrices are always symmetric and 
positive semi-definite,
+   * we can use the eigendecomposition.
+   * 
+   * To guard against singular covariance matrices, this method computes 
both the 
+   * pseudo-determinant and the pseudo-inverse (Moore-Penrose).  Singular 
values are considered
+   * to be non-zero only if they exceed a tolerance based on machine 
precision, matrix size, and
+   * relation to the maximum singular value (same tolerance used by, e.g., 
Octave).
+   */
+  private def calculateCovarianceConstants: (DBM[Double], Double) = {
+val eigSym.EigSym(d, u) = eigSym(sigma) // sigma = u * diag(d) * u.t
+
+// For numerical stability, values are considered to be non-zero only 
if they exceed tol.
+// This prevents any inverted value from exceeding (eps * n * 
max(d))^-1
+val tol = MLUtils.EPSILON * max(d) * d.length
+
+// pseudo-determinant is product of all non-zero eigenvalues
--- End diff --

eigenvalues -- singular values  (here and in the next comment on line 
79)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22429030
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussianSuite.scala
 ---
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.impl
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.{Vectors, Matrices}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class MultivariateGaussianSuite extends FunSuite with 
MLlibTestSparkContext {
+  test(univariate) {
+val x = Vectors.dense(0.0).toBreeze.toDenseVector
+
+val mu = Vectors.dense(0.0).toBreeze.toDenseVector
+var sigma = Matrices.dense(1, 1, Array(1.0)).toBreeze.toDenseMatrix
+var dist = new MultivariateGaussian(mu, sigma)
+assert(dist.pdf(x) ~== 0.39894 absTol 1E-5)
+
+sigma = Matrices.dense(1, 1, Array(4.0)).toBreeze.toDenseMatrix
+dist = new MultivariateGaussian(mu, sigma)
+assert(dist.pdf(x) ~== 0.19947 absTol 1E-5)
+  }
+  
+  test(multivariate) {
+val x = Vectors.dense(0.0, 0.0).toBreeze.toDenseVector
+
+val mu = Vectors.dense(0.0, 0.0).toBreeze. toDenseVector
+var sigma = Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 
1.0)).toBreeze.toDenseMatrix
+var dist = new MultivariateGaussian(mu, sigma)
+assert(dist.pdf(x) ~== 0.15915 absTol 1E-5)
+
+sigma = Matrices.dense(2, 2, Array(4.0, -1.0, -1.0, 
2.0)).toBreeze.toDenseMatrix
+dist = new MultivariateGaussian(mu, sigma)
+assert(dist.pdf(x) ~== 0.060155 absTol 1E-5)
+  }
+  
+  test(multivariate degenerate) {
+val x = Vectors.dense(0.0, 0.0).toBreeze.toDenseVector
+
+val mu = Vectors.dense(0.0, 0.0).toBreeze. toDenseVector
--- End diff --

typo: space between . and toDenseVector


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4631] unit test for MQTT

2015-01-02 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/3844#discussion_r22429050
  
--- Diff: 
external/mqtt/src/test/scala/org/apache/spark/streaming/mqtt/MQTTStreamSuite.scala
 ---
@@ -17,31 +17,114 @@
 
 package org.apache.spark.streaming.mqtt
 
-import org.scalatest.FunSuite
+import java.net.{URI, ServerSocket}
 
-import org.apache.spark.streaming.{Seconds, StreamingContext}
+import org.apache.activemq.broker.{TransportConnector, BrokerService}
+import org.apache.spark.util.Utils
+import org.scalatest.{BeforeAndAfter, FunSuite}
+import org.scalatest.concurrent.Eventually
+import scala.concurrent.duration._
+import org.apache.spark.streaming.{Milliseconds, StreamingContext}
 import org.apache.spark.storage.StorageLevel
 import org.apache.spark.streaming.dstream.ReceiverInputDStream
+import org.eclipse.paho.client.mqttv3._
+import org.eclipse.paho.client.mqttv3.persist.MqttDefaultFilePersistence
 
-class MQTTStreamSuite extends FunSuite {
-
-  val batchDuration = Seconds(1)
+class MQTTStreamSuite extends FunSuite with Eventually with BeforeAndAfter 
{
 
+  private val batchDuration = Milliseconds(500)
   private val master: String = local[2]
-
   private val framework: String = this.getClass.getSimpleName
+  private val freePort = findFreePort()
+  private val brokerUri = //localhost: + freePort
+  private val topic = def
+  private var ssc: StreamingContext = _
+  private val persistenceDir = Utils.createTempDir()
+  private var broker: BrokerService = _
+  private var connector: TransportConnector = _
 
-  test(mqtt input stream) {
-val ssc = new StreamingContext(master, framework, batchDuration)
-val brokerUrl = abc
-val topic = def
+  before {
+ssc = new StreamingContext(master, framework, batchDuration)
+setupMQTT
+  }
 
-// tests the API, does not actually test data receiving
-val test1: ReceiverInputDStream[String] = MQTTUtils.createStream(ssc, 
brokerUrl, topic)
-val test2: ReceiverInputDStream[String] =
-  MQTTUtils.createStream(ssc, brokerUrl, topic, 
StorageLevel.MEMORY_AND_DISK_SER_2)
+  after {
+if (ssc != null) {
+  ssc.stop()
+  ssc = null
+}
+Utils.deleteRecursively(persistenceDir)
+tearDownMQTT
+  }
 
-// TODO: Actually test receiving data
+  test(mqtt input stream) {
+val sendMessage = MQTT demo for spark streaming
+val receiveStream: ReceiverInputDStream[String] =
+  MQTTUtils.createStream(ssc, tcp: + brokerUri, topic, 
StorageLevel.MEMORY_ONLY)
+var receiveMessage: List[String] = List()
+receiveStream.foreachRDD { rdd =
+  if (rdd.collect.length  0) {
+receiveMessage = receiveMessage ::: List(rdd.first)
+receiveMessage
+  }
+}
+ssc.start()
+publishData(sendMessage)
+eventually(timeout(1 milliseconds), interval(100 milliseconds)) {
+  assert(sendMessage.equals(receiveMessage(0)))
+}
 ssc.stop()
   }
+
+  private def setupMQTT() {
+broker = new BrokerService()
+connector = new TransportConnector()
+connector.setName(mqtt)
+connector.setUri(new URI(mqtt: + brokerUri))
+broker.addConnector(connector)
+broker.start()
+  }
+
+  private def tearDownMQTT() {
+if (broker != null) {
+  broker.stop()
+  broker = null
+}
+if (connector != null) {
+  connector.stop()
+  connector = null
+}
+  }
+
+  private def findFreePort(): Int = {
+Utils.startServiceOnPort(23456, (trialPort: Int) = {
+  val socket = new ServerSocket(trialPort)
+  socket.close()
+  (null, trialPort)
+})._2
+  }
+
+  def publishData(data: String): Unit = {
+var client: MqttClient = null
+try {
+  val persistence: MqttClientPersistence = new 
MqttDefaultFilePersistence(persistenceDir.getAbsolutePath)
+  client = new MqttClient(tcp: + brokerUri, 
MqttClient.generateClientId(), persistence)
+  client.connect()
+  if (client.isConnected) {
+val msgTopic: MqttTopic = client.getTopic(topic)
+val message: MqttMessage = new MqttMessage(data.getBytes(utf-8))
+message.setQos(1)
+message.setRetained(true)
+for (i - 0 to 10)
+  msgTopic.publish(message)
+  }
+} catch {
+  case e: MqttException = println(Exception Caught:  + e)
--- End diff --

Why can there be an exception? And if there is an exception, why is it 
being ignored? Printing and not doing anything is essentially ignoring if the 
unit test 

[GitHub] spark pull request: [SPARK-4835] Disable validateOutputSpecs for S...

2015-01-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3832#issuecomment-68580794
  
@tdas I've updated this PR and added a test case.  My test case uses calls 
inside of a `transform()` call to emulate what Streaming's `saveAsHadoopFiles` 
operation does.  Is this a valid use of `transform()` or am I breaking rules by 
having actions in my transform function?  My gut says that we shouldn't endorse 
/ recommend this for the same reason that we advise against using accumulators 
inside of map() tasks: the transform call might get evaluated multiple times if 
caching isn't use, which makes it possible to write programs whose behavior 
changes depending on whether caching is enabled.

I wasn't able to get the existing recovery with saveAsNewAPIHadoopFiles 
operation test to fail, though, even though I discovered this bug while 
refactoring that test in my other PR.  I think that the issue is that the 
failed `saveAsNewAPIHadoopFiles` jobs failed but did not trigger a failure of 
the other actions / transformations in that batch, so we still got the correct 
output even though the batch completion event wasn't posted to the listener 
bus.  The current tests rely on wall-clock time to detect when batches have 
been processed and hence didn't detect that the batch completion event was 
missing.  I noticed that the StreamingListener API doesn't really have any 
events for job / batch failures, but that's a topic for a separate PR.

I was about to write that this bug might not actually affect users who 
don't use `transform` but it still impacts users in the partial-failure case 
where they've used PairDStreamFunctions.saveAsHadoopFiles() but a batch fails 
with partially-written output: an individual output _partition_ might be 
atomically committed to the output directory (e.g. if the file exists, then it 
has the right contents), but I think we can still wind up in a scenario where 
only a subset of the partitions are written and the non-empty output directory 
prevents the recovery from recomputing the missing partitions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22429041
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
 ---
@@ -17,23 +17,69 @@
 
 package org.apache.spark.mllib.stat.impl
 
-import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, 
det, pinv}
+import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, 
eigSym}
 
-/** 
-   * Utility class to implement the density function for multivariate 
Gaussian distribution.
-   * Breeze provides this functionality, but it requires the Apache 
Commons Math library,
-   * so this class is here so-as to not introduce a new dependency in 
Spark.
-   */
+import org.apache.spark.mllib.util.MLUtils
+
+/*
--- End diff --

Use ```/**```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68580981
  
  [Test build #25001 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25001/consoleFull)
 for   PR 3871 at commit 
[`d448137`](https://github.com/apache/spark/commit/d448137b739691c152dd981f136cef62b65d4e50).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4835] Disable validateOutputSpecs for S...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3832#issuecomment-68580537
  
  [Test build #25003 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25003/consoleFull)
 for   PR 3832 at commit 
[`6485cf8`](https://github.com/apache/spark/commit/6485cf880465cf7bd8e501dc861869be58029995).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68581031
  
@tgaloppo Thanks for the updates.  Sure, the log-space computation could be 
in another PR.

Just to make sure: Did you compute the PDF values in the tests using other 
software?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4631] unit test for MQTT

2015-01-02 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/3844#issuecomment-68581068
  
This is almost looking good. few more comments and we are ready. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5057]Add more details in log when using...

2015-01-02 Thread WangTaoTheTonic
GitHub user WangTaoTheTonic opened a pull request:

https://github.com/apache/spark/pull/3875

[SPARK-5057]Add more details in log when using actor to get infos

https://issues.apache.org/jira/browse/SPARK-5057

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WangTaoTheTonic/spark SPARK-5057

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3875.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3875


commit 706c8a7d02a07bfc6b096221777f44eabc36467b
Author: WangTaoTheTonic barneystin...@aliyun.com
Date:   2015-01-02T10:20:41Z

log more messages




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md

2015-01-02 Thread akhld
GitHub user akhld opened a pull request:

https://github.com/apache/spark/pull/3876

Fixed typos in streaming-kafka-integration.md

Changed projrect to project :)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/akhld/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3876.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3876


commit e0cf9ef44a7c5b324158325d59acbea7236f9203
Author: Akhil Das ak...@darktech.ca
Date:   2015-01-02T10:32:12Z

Fixed typos in streaming-kafka-integration.md

Changed projrect to project :)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5052] Add common/base classes to fix gu...

2015-01-02 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3874#issuecomment-68514968
  
You're right that I think this is too broad. I think I misspoke earlier. 
Isn't the theory here that you can bring a later version if Optional with you 
in your app?  Spark barely uses its API. If your copy of Optional hides the one 
in Spark, which is only there to keep the signature the same, is that OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3741#issuecomment-68517965
  
  [Test build #24992 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24992/consoleFull)
 for   PR 3741 at commit 
[`46ad71e`](https://github.com/apache/spark/commit/46ad71ed44df4f1dbea7614ae2057ab1d6207ab4).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3741#issuecomment-68517724
  
  [Test build #24991 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24991/consoleFull)
 for   PR 3741 at commit 
[`1b047e6`](https://github.com/apache/spark/commit/1b047e6cefb652e8ce4d2cf0cbd57bcc84654370).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5057]Add more details in log when using...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3875#issuecomment-68518118
  
  [Test build #24993 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24993/consoleFull)
 for   PR 3875 at commit 
[`706c8a7`](https://github.com/apache/spark/commit/706c8a7d02a07bfc6b096221777f44eabc36467b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3741#issuecomment-68517726
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24991/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3741#issuecomment-68517967
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24992/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3876#issuecomment-68518565
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...

2015-01-02 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3861#discussion_r22418308
  
--- Diff: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedMesosExecutorBackend.scala
 ---
@@ -0,0 +1,212 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.executor
+
+import org.apache.spark.{SparkConf, Logging, SecurityManager}
+import org.apache.mesos.{Executor = MesosExecutor, ExecutorDriver, 
MesosExecutorDriver, MesosNativeLibrary}
+import org.apache.spark.util.{Utils, SignalLogger}
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.mesos.Protos._
+import org.apache.spark.deploy.worker.StandaloneWorkerShuffleService
+import scala.collection.JavaConversions._
+import scala.io.Source
+import java.io.{File, PrintWriter}
+
+/**
+ * The Coarse grained Mesos executor backend is responsible for launching 
the shuffle service
+ * and the CoarseGrainedExecutorBackend actor.
+ * This is assuming the scheduler detected that the shuffle service is 
enabled and launches
+ * this class instead of CoarseGrainedExecutorBackend directly.
+ */
+private[spark] class CoarseGrainedMesosExecutorBackend(val sparkConf: 
SparkConf)
+  extends MesosExecutor
+  with Logging {
+
+  private var shuffleService: StandaloneWorkerShuffleService = null
+  private var driver: ExecutorDriver = null
+  private var executorProc: Process = null
+  private var taskId: TaskID = null
+  @volatile var killed = false
+
+  override def registered(
+  driver: ExecutorDriver,
+  executorInfo: ExecutorInfo,
+  frameworkInfo: FrameworkInfo,
+  slaveInfo: SlaveInfo) {
+this.driver = driver
+logInfo(Coarse Grain Mesos Executor ' + 
executorInfo.getExecutorId.getValue +
--- End diff --

Grained


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22420650
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
+
+private[spark] class NoOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = 0
+}
+
+private[spark] class KeyValueOrdering[A, B](
+  ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]]
+) extends Ordering[Product2[A, B]] {
+  private val ord1 = ordering1.getOrElse(new HashOrdering[A])
+  private val ord2 = ordering2.getOrElse(new NoOrdering[B])
--- End diff --

What is the expected scenario in which a `KeyValueOrdering` is called for 
with `B` unordered? You're setting up `KeyValueOrdering` to be more general 
than your needs for its only current usage in `OrderedValueRDDFunctions`, but 
I'm not quite grasping how and where else you are expecting `KeyValueOrdering` 
to be used.

It's seeming to me that `KeyValueOrdering` should have two ctors: 
```scala
KeyValueOrdering[A, B](keyOrdering: Ordering[A], valueOrdering: Ordering[B])

...

this(valueOrdering: Ordering[B]) = this(new HashOrdering[A], valueOrdering)
``` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22422723
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
+
+private[spark] class NoOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = 0
+}
+
+private[spark] class KeyValueOrdering[A, B](
+  ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]]
+) extends Ordering[Product2[A, B]] {
+  private val ord1 = ordering1.getOrElse(new HashOrdering[A])
+  private val ord2 = ordering2.getOrElse(new NoOrdering[B])
+
+  override def compare(x: Product2[A, B], y: Product2[A, B]): Int = {
+val c1 = ord1.compare(x._1, y._1)
+if (c1 != 0) c1 else ord2.compare(x._2, y._2)
--- End diff --

What happens when `ord1` is `HashOrdering` and `c1 == 0` but `x._1 != 
y._1`?  More generally, what happens when `ord1` isn't actually a full 
ordering? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22421802
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
--- End diff --

`ExternalSorter#keyComparator` should be refactored to use 
`spark.util.HashOrdering`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22422719
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
--- End diff --

This isn't actually true.  The `compare` method only produces a partial 
ordering.  `ExternalSorter#keyComparator` gets away with the `Ordering[K]` 
falsehood only because later passes resolve hash collisions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3876#issuecomment-68557428
  
  [Test build #24997 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24997/consoleFull)
 for   PR 3876 at commit 
[`e0cf9ef`](https://github.com/apache/spark/commit/e0cf9ef44a7c5b324158325d59acbea7236f9203).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3876#issuecomment-68557433
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24997/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68562386
  
@tgaloppo Could you please add a description?  It can be based off of the 
JIRA, just enough to cover the main points of the PR.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Branch 1.2

2015-01-02 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/3880#issuecomment-68550117
  
Hi @hxfeng I think this might be an accidental pull request -- merging 1.2 
back into master would be a huge change!

Would you mind closing this PR?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22428452
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
+
+private[spark] class NoOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = 0
+}
+
+private[spark] class KeyValueOrdering[A, B](
+  ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]]
+) extends Ordering[Product2[A, B]] {
+  private val ord1 = ordering1.getOrElse(new HashOrdering[A])
+  private val ord2 = ordering2.getOrElse(new NoOrdering[B])
+
+  override def compare(x: Product2[A, B], y: Product2[A, B]): Int = {
+val c1 = ord1.compare(x._1, y._1)
+if (c1 != 0) c1 else ord2.compare(x._2, y._2)
--- End diff --

i see 2 options:
1) do something similar to what happens in 
ExternalSorter.mergeWithAggregation where in groupByKeyAndSortValues i am aware 
of the fact that i might be processing multiple keys (with same hashCode) at 
once and check for key equality. this increases memory requirements
(all values for all keys with same hashCode have to fit in memory as 
opposed to all values for a single key).
2) require an ordering for K which can be used as a tie breaker when the 
hashCodes of the keys are the same, so that i have a total ordering for K.

thoughts?

i will add a unit test where i have multiple keys with the same hashCode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5061][Alex Baretta] SQLContext: overloa...

2015-01-02 Thread alexbaretta
Github user alexbaretta commented on a diff in the pull request:

https://github.com/apache/spark/pull/3882#discussion_r22428318
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -269,6 +269,43 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 path, ScalaReflection.attributesFor[A], allowExisting, conf, this))
   }
 
+
+  /**
+   * :: Experimental ::
+   * Creates an empty parquet file with the provided schema. The parquet 
file thus created
+   * can be registered as a table, which can then be used as the target of 
future
+   * `insertInto` operations.
+   *
+   * {{{
+   *   val sqlContext = new SQLContext(...)
+   *   import sqlContext._
+   *
+   *   val schema = StructType(List(StructField(name, 
StringType),StructField(age, IntegerType)))
+   *   createParquetFile(schema, 
path/to/file.parquet).registerTempTable(people)
+   *   sql(INSERT INTO people SELECT 'michael', 29)
+   * }}}
+   *
+   * @param schema StructType describing the records to be stored in the 
Parquet file.
+   * @param path The path where the directory containing parquet metadata 
should be created.
+   * Data inserted into this table will also be stored at this 
location.
+   * @param allowExisting When false, an exception will be thrown if this 
directory already exists.
+   * @param conf A Hadoop configuration object that can be used to specify 
options to the parquet
+   * output format.
+   *
+   * @group userf
+   */
+  @Experimental
+  def createParquetFile(
--- End diff --

Andrew,

OK, but keep in mind that my patch overloads an existing method. If you
think createParquetFile should be renamed to createEmptyParquetFile you
should probably file a separate JIRA.

Also, arguably creating a file implies that it is empty.

Alex
On Jan 2, 2015 5:11 PM, Andrew Ash notificati...@github.com wrote:

 In sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
 https://github.com/apache/spark/pull/3882#discussion-diff-22428199:

  +   *   val schema = StructType(List(StructField(name, 
StringType),StructField(age, IntegerType)))
  +   *   createParquetFile(schema, 
path/to/file.parquet).registerTempTable(people)
  +   *   sql(INSERT INTO people SELECT 'michael', 29)
  +   * }}}
  +   *
  +   * @param schema StructType describing the records to be stored in 
the Parquet file.
  +   * @param path The path where the directory containing parquet 
metadata should be created.
  +   * Data inserted into this table will also be stored at 
this location.
  +   * @param allowExisting When false, an exception will be thrown if 
this directory already exists.
  +   * @param conf A Hadoop configuration object that can be used to 
specify options to the parquet
  +   * output format.
  +   *
  +   * @group userf
  +   */
  +  @Experimental
  +  def createParquetFile(

 I kind of think createEmptyParquetFile would be a better name for this
 method, since most Parquet files have data I'd think

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3882/files#r22428199.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5062][Graphx] replace mapReduceTriplets...

2015-01-02 Thread shijinkui
GitHub user shijinkui opened a pull request:

https://github.com/apache/spark/pull/3883

[SPARK-5062][Graphx] replace mapReduceTriplets with aggregateMessage in 
Pregel Api

since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it 
improve the performance indeed.

it's time to replace mapReduceTriplets with aggregateMessage in Pregel.
i provide a deprecated method thinking about compatibility

--
i have draw a graph of aggregateMessage to show why it can improve the 
performance.


![graphx_aggreate_msg](https://cloud.githubusercontent.com/assets/648508/5601161/0444efdc-932b-11e4-8944-8e132339be9b.jpg)


dfgdfgd


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shijinkui/spark pregel_agg

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3883.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3883


commit 93ae74bc5c9011719775e9862f257c2e81a9
Author: 玄畅 jinkui@alibaba-inc.com
Date:   2015-01-01T02:43:27Z

change  mapReduceTriplets to aggregateMessages of Pregel API

commit d2519e235c53c8ee53c5f127cf680585f139eb0c
Author: 玄畅 jinkui@alibaba-inc.com
Date:   2015-01-01T03:21:30Z

change  mapReduceTriplets to aggregateMessages of Pregel API




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5058] Updated broken links

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3877#issuecomment-68577996
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24999/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5058] Updated broken links

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3877#issuecomment-68577995
  
  [Test build #24999 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24999/consoleFull)
 for   PR 3877 at commit 
[`3e19b31`](https://github.com/apache/spark/commit/3e19b31890f8317550c28b60edc3f5ea3137776c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5062][Graphx] replace mapReduceTriplets...

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3883#issuecomment-68578615
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68578626
  
  [Test build #25001 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25001/consoleFull)
 for   PR 3871 at commit 
[`d448137`](https://github.com/apache/spark/commit/d448137b739691c152dd981f136cef62b65d4e50).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5061][Alex Baretta] SQLContext: overloa...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3882#issuecomment-68577765
  
  [Test build #25000 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25000/consoleFull)
 for   PR 3882 at commit 
[`f6e40b5`](https://github.com/apache/spark/commit/f6e40b50c4aca9372c51d1337d559fc9cf50108d).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3861#issuecomment-68580156
  
  [Test build #25002 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25002/consoleFull)
 for   PR 3861 at commit 
[`a8d036c`](https://github.com/apache/spark/commit/a8d036cf6ec4b8b1fa621a4da955f3274517e41f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5061][Alex Baretta] SQLContext: overloa...

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3882#issuecomment-68577767
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25000/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...

2015-01-02 Thread tnachen
Github user tnachen commented on the pull request:

https://github.com/apache/spark/pull/3861#issuecomment-68580094
  
@ash211 Thanks for the review, updated the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3741#issuecomment-68514411
  
  [Test build #24992 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24992/consoleFull)
 for   PR 3741 at commit 
[`46ad71e`](https://github.com/apache/spark/commit/46ad71ed44df4f1dbea7614ae2057ab1d6207ab4).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3741#issuecomment-68514220
  
  [Test build #24991 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24991/consoleFull)
 for   PR 3741 at commit 
[`1b047e6`](https://github.com/apache/spark/commit/1b047e6cefb652e8ce4d2cf0cbd57bcc84654370).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22423972
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
 ---
@@ -17,23 +17,62 @@
 
 package org.apache.spark.mllib.stat.impl
 
-import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, 
det, pinv}
+import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, 
eigSym}
 
-/** 
-   * Utility class to implement the density function for multivariate 
Gaussian distribution.
-   * Breeze provides this functionality, but it requires the Apache 
Commons Math library,
-   * so this class is here so-as to not introduce a new dependency in 
Spark.
-   */
+import org.apache.spark.mllib.util.MLUtils
+
+/*
+ * This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution
+ * 
+ * @param mu The mean vector of the distribution
+ * @param sigma The covariance matrix of the distribution
+ */
 private[mllib] class MultivariateGaussian(
 val mu: DBV[Double], 
 val sigma: DBM[Double]) extends Serializable {
-  private val sigmaInv2 = pinv(sigma) * -0.5
-  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * 
math.pow(det(sigma), -0.5)
-
+
+  private val (sigmaInv2, u) = calculateCovarianceConstants
+  
   /** Returns density of this multivariate Gaussian at given point, x */
   def pdf(x: DBV[Double]): Double = {
 val delta = x - mu
-val deltaTranspose = new Transpose(delta)
-U * math.exp(deltaTranspose * sigmaInv2 * delta)
+u * math.exp(delta.t * sigmaInv2 * delta)
+  }
+  
+  /*
+   * Calculate distribution dependent components used for the density 
function:
+   *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * 
(x-mu).t * inv(sigma) * (x-mu) )
+   * where k is length of the mean vector.
+   * 
+   * We here compute distribution-fixed parts 
+   *  (2*pi)^(-k/2) * det(sigma)^(-1/2)
+   * and
+   *  (-1/2) * inv(sigma)
+   *  
+   * Both the determinant and the inverse can be computed from the 
singular value decomposition
+   * of sigma.  Noting that covariance matrices are always symmetric and 
positive semi-definite,
+   * we can use the eigendecomposition (breeze provides one specifically 
for symmetric matrices,
+   * so I am making an assumption here that there is some efficiency gain).
+   * 
+   * To guard against singular covariance matrices, this method computes 
both the 
+   * pseudo-determinant and the pseudo-inverse (Moore-Penrose).  Singular 
values are considered
+   * to be non-zero only if they exceed a tolerance based on machine 
precision, matrix size, and
+   * relation to the maximum singular value (same tolerance used by, ie, 
Octave).
+   */
+  private def calculateCovarianceConstants: (DBM[Double], Double) = {
+val eigSym.EigSym(d, u) = eigSym(sigma) // sigma = u * diag(d) * u.t
+
+// For numerical stability, values are considered to be non-zero only 
if they exceed tol.
+// This prevents any inverted value from exceeding (eps * n * 
max(d))^-1
+val tol = MLUtils.EPSILON * max(d) * d.length
+
+// pseudo-determinant is product of all non-zero eigenvalues
+val pdetSigma = (0 until d.length).map(i = if (d(i)  tol) d(i) else 
1.0).reduce(_ * _)
+
+// calculate pseudo-inverse by inverting all non-zero eigenvalues
+val pinvS = new DBV((0 until d.length).map(i = if (d(i)  tol) (1.0 / 
d(i)) else 0.0).toArray)
--- End diff --

This too can be more concise.
You generally do not need to use the ```(0 until length).map``` pattern 
unless you need the indices; it is easier to map the values of an array like d 
directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22423967
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
 ---
@@ -17,23 +17,62 @@
 
 package org.apache.spark.mllib.stat.impl
 
-import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, 
det, pinv}
+import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, 
eigSym}
 
-/** 
-   * Utility class to implement the density function for multivariate 
Gaussian distribution.
-   * Breeze provides this functionality, but it requires the Apache 
Commons Math library,
-   * so this class is here so-as to not introduce a new dependency in 
Spark.
-   */
+import org.apache.spark.mllib.util.MLUtils
+
+/*
+ * This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution
+ * 
+ * @param mu The mean vector of the distribution
+ * @param sigma The covariance matrix of the distribution
+ */
 private[mllib] class MultivariateGaussian(
 val mu: DBV[Double], 
 val sigma: DBM[Double]) extends Serializable {
-  private val sigmaInv2 = pinv(sigma) * -0.5
-  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * 
math.pow(det(sigma), -0.5)
-
+
+  private val (sigmaInv2, u) = calculateCovarianceConstants
+  
   /** Returns density of this multivariate Gaussian at given point, x */
   def pdf(x: DBV[Double]): Double = {
 val delta = x - mu
-val deltaTranspose = new Transpose(delta)
-U * math.exp(deltaTranspose * sigmaInv2 * delta)
+u * math.exp(delta.t * sigmaInv2 * delta)
+  }
+  
+  /*
+   * Calculate distribution dependent components used for the density 
function:
+   *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * 
(x-mu).t * inv(sigma) * (x-mu) )
+   * where k is length of the mean vector.
+   * 
+   * We here compute distribution-fixed parts 
+   *  (2*pi)^(-k/2) * det(sigma)^(-1/2)
+   * and
+   *  (-1/2) * inv(sigma)
+   *  
+   * Both the determinant and the inverse can be computed from the 
singular value decomposition
+   * of sigma.  Noting that covariance matrices are always symmetric and 
positive semi-definite,
+   * we can use the eigendecomposition (breeze provides one specifically 
for symmetric matrices,
+   * so I am making an assumption here that there is some efficiency gain).
+   * 
+   * To guard against singular covariance matrices, this method computes 
both the 
+   * pseudo-determinant and the pseudo-inverse (Moore-Penrose).  Singular 
values are considered
+   * to be non-zero only if they exceed a tolerance based on machine 
precision, matrix size, and
+   * relation to the maximum singular value (same tolerance used by, ie, 
Octave).
--- End diff --

ie -- e.g.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22423970
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
 ---
@@ -17,23 +17,62 @@
 
 package org.apache.spark.mllib.stat.impl
 
-import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, 
det, pinv}
+import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, 
eigSym}
 
-/** 
-   * Utility class to implement the density function for multivariate 
Gaussian distribution.
-   * Breeze provides this functionality, but it requires the Apache 
Commons Math library,
-   * so this class is here so-as to not introduce a new dependency in 
Spark.
-   */
+import org.apache.spark.mllib.util.MLUtils
+
+/*
+ * This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution
+ * 
+ * @param mu The mean vector of the distribution
+ * @param sigma The covariance matrix of the distribution
+ */
 private[mllib] class MultivariateGaussian(
 val mu: DBV[Double], 
 val sigma: DBM[Double]) extends Serializable {
-  private val sigmaInv2 = pinv(sigma) * -0.5
-  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * 
math.pow(det(sigma), -0.5)
-
+
+  private val (sigmaInv2, u) = calculateCovarianceConstants
+  
   /** Returns density of this multivariate Gaussian at given point, x */
   def pdf(x: DBV[Double]): Double = {
 val delta = x - mu
-val deltaTranspose = new Transpose(delta)
-U * math.exp(deltaTranspose * sigmaInv2 * delta)
+u * math.exp(delta.t * sigmaInv2 * delta)
+  }
+  
+  /*
+   * Calculate distribution dependent components used for the density 
function:
+   *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * 
(x-mu).t * inv(sigma) * (x-mu) )
+   * where k is length of the mean vector.
+   * 
+   * We here compute distribution-fixed parts 
+   *  (2*pi)^(-k/2) * det(sigma)^(-1/2)
+   * and
+   *  (-1/2) * inv(sigma)
+   *  
+   * Both the determinant and the inverse can be computed from the 
singular value decomposition
+   * of sigma.  Noting that covariance matrices are always symmetric and 
positive semi-definite,
+   * we can use the eigendecomposition (breeze provides one specifically 
for symmetric matrices,
+   * so I am making an assumption here that there is some efficiency gain).
+   * 
+   * To guard against singular covariance matrices, this method computes 
both the 
+   * pseudo-determinant and the pseudo-inverse (Moore-Penrose).  Singular 
values are considered
+   * to be non-zero only if they exceed a tolerance based on machine 
precision, matrix size, and
+   * relation to the maximum singular value (same tolerance used by, ie, 
Octave).
+   */
+  private def calculateCovarianceConstants: (DBM[Double], Double) = {
+val eigSym.EigSym(d, u) = eigSym(sigma) // sigma = u * diag(d) * u.t
+
+// For numerical stability, values are considered to be non-zero only 
if they exceed tol.
+// This prevents any inverted value from exceeding (eps * n * 
max(d))^-1
+val tol = MLUtils.EPSILON * max(d) * d.length
+
+// pseudo-determinant is product of all non-zero eigenvalues
+val pdetSigma = (0 until d.length).map(i = if (d(i)  tol) d(i) else 
1.0).reduce(_ * _)
--- End diff --

More concise:
```
val pdetSigma = d.activeValuesIterator.filter(_  tol).foldLeft(1.0)(_ * _)
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22423959
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
 ---
@@ -17,23 +17,62 @@
 
 package org.apache.spark.mllib.stat.impl
 
-import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, 
det, pinv}
+import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, 
eigSym}
 
-/** 
-   * Utility class to implement the density function for multivariate 
Gaussian distribution.
-   * Breeze provides this functionality, but it requires the Apache 
Commons Math library,
-   * so this class is here so-as to not introduce a new dependency in 
Spark.
-   */
+import org.apache.spark.mllib.util.MLUtils
+
+/*
+ * This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution
--- End diff --

Perhaps you could add a note here about how this behaves when sigma is 
singular, plus a reference like 
[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]

The doc could be a short version of what you have below for 
```calculateCovarianceConstants```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22423964
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
 ---
@@ -17,23 +17,62 @@
 
 package org.apache.spark.mllib.stat.impl
 
-import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, 
det, pinv}
+import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, 
eigSym}
 
-/** 
-   * Utility class to implement the density function for multivariate 
Gaussian distribution.
-   * Breeze provides this functionality, but it requires the Apache 
Commons Math library,
-   * so this class is here so-as to not introduce a new dependency in 
Spark.
-   */
+import org.apache.spark.mllib.util.MLUtils
+
+/*
+ * This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution
+ * 
+ * @param mu The mean vector of the distribution
+ * @param sigma The covariance matrix of the distribution
+ */
 private[mllib] class MultivariateGaussian(
 val mu: DBV[Double], 
 val sigma: DBM[Double]) extends Serializable {
-  private val sigmaInv2 = pinv(sigma) * -0.5
-  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * 
math.pow(det(sigma), -0.5)
-
+
+  private val (sigmaInv2, u) = calculateCovarianceConstants
+  
   /** Returns density of this multivariate Gaussian at given point, x */
   def pdf(x: DBV[Double]): Double = {
 val delta = x - mu
-val deltaTranspose = new Transpose(delta)
-U * math.exp(deltaTranspose * sigmaInv2 * delta)
+u * math.exp(delta.t * sigmaInv2 * delta)
+  }
+  
+  /*
+   * Calculate distribution dependent components used for the density 
function:
+   *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * 
(x-mu).t * inv(sigma) * (x-mu) )
+   * where k is length of the mean vector.
+   * 
+   * We here compute distribution-fixed parts 
+   *  (2*pi)^(-k/2) * det(sigma)^(-1/2)
+   * and
+   *  (-1/2) * inv(sigma)
+   *  
+   * Both the determinant and the inverse can be computed from the 
singular value decomposition
+   * of sigma.  Noting that covariance matrices are always symmetric and 
positive semi-definite,
+   * we can use the eigendecomposition (breeze provides one specifically 
for symmetric matrices,
--- End diff --

No need to comment on Breeze here; you can in the PR description if it's in 
question.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22423961
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
 ---
@@ -17,23 +17,62 @@
 
 package org.apache.spark.mllib.stat.impl
 
-import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, 
det, pinv}
+import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, 
eigSym}
 
-/** 
-   * Utility class to implement the density function for multivariate 
Gaussian distribution.
-   * Breeze provides this functionality, but it requires the Apache 
Commons Math library,
-   * so this class is here so-as to not introduce a new dependency in 
Spark.
-   */
+import org.apache.spark.mllib.util.MLUtils
+
+/*
+ * This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution
+ * 
+ * @param mu The mean vector of the distribution
+ * @param sigma The covariance matrix of the distribution
+ */
 private[mllib] class MultivariateGaussian(
 val mu: DBV[Double], 
 val sigma: DBM[Double]) extends Serializable {
-  private val sigmaInv2 = pinv(sigma) * -0.5
-  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * 
math.pow(det(sigma), -0.5)
-
+
+  private val (sigmaInv2, u) = calculateCovarianceConstants
--- End diff --

Could you please add explicit types + documentation here for clarity?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3871#discussion_r22423962
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala
 ---
@@ -17,23 +17,62 @@
 
 package org.apache.spark.mllib.stat.impl
 
-import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, 
det, pinv}
+import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, 
eigSym}
 
-/** 
-   * Utility class to implement the density function for multivariate 
Gaussian distribution.
-   * Breeze provides this functionality, but it requires the Apache 
Commons Math library,
-   * so this class is here so-as to not introduce a new dependency in 
Spark.
-   */
+import org.apache.spark.mllib.util.MLUtils
+
+/*
+ * This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution
+ * 
+ * @param mu The mean vector of the distribution
+ * @param sigma The covariance matrix of the distribution
+ */
 private[mllib] class MultivariateGaussian(
 val mu: DBV[Double], 
 val sigma: DBM[Double]) extends Serializable {
-  private val sigmaInv2 = pinv(sigma) * -0.5
-  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * 
math.pow(det(sigma), -0.5)
-
+
+  private val (sigmaInv2, u) = calculateCovarianceConstants
+  
   /** Returns density of this multivariate Gaussian at given point, x */
   def pdf(x: DBV[Double]): Double = {
 val delta = x - mu
-val deltaTranspose = new Transpose(delta)
-U * math.exp(deltaTranspose * sigmaInv2 * delta)
+u * math.exp(delta.t * sigmaInv2 * delta)
+  }
+  
+  /*
--- End diff --

Use ```/**``` for class/method documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22423736
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
+
+private[spark] class NoOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = 0
+}
+
+private[spark] class KeyValueOrdering[A, B](
+  ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]]
+) extends Ordering[Product2[A, B]] {
+  private val ord1 = ordering1.getOrElse(new HashOrdering[A])
+  private val ord2 = ordering2.getOrElse(new NoOrdering[B])
+
+  override def compare(x: Product2[A, B], y: Product2[A, B]): Int = {
+val c1 = ord1.compare(x._1, y._1)
+if (c1 != 0) c1 else ord2.compare(x._2, y._2)
--- End diff --

good point that doesn't look right. it could lead to keys being interleaved 
in the output.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22423573
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
+
+private[spark] class NoOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = 0
+}
+
+private[spark] class KeyValueOrdering[A, B](
+  ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]]
+) extends Ordering[Product2[A, B]] {
+  private val ord1 = ordering1.getOrElse(new HashOrdering[A])
+  private val ord2 = ordering2.getOrElse(new NoOrdering[B])
--- End diff --

yeah thats right i copied it from another pullreq by me that needed a more 
general version. i can simplify it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Allow spark-daemon.sh to support foreground op...

2015-01-02 Thread hellertime
GitHub user hellertime opened a pull request:

https://github.com/apache/spark/pull/3881

Allow spark-daemon.sh to support foreground operation

Add `--foreground` option to spark-daemon.sh to prevent the process from 
daemonizing itself. Useful if running under a watchdog which waits on its child 
process.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hellertime/spark 
feature/no-daemon-spark-daemon

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3881.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3881


commit 358400fb4f87f2e6de791a116bfd64c5a31f9d39
Author: Chris Heller hellert...@gmail.com
Date:   2014-12-29T19:28:53Z

Allow spark-daemon.sh to support foreground operation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68565593
  
@tgaloppo @mengxr  What are your thoughts about doing the computation in 
log space as much as possible, and then exponentiating at the end?  I'm mainly 
thinking about numerical stability, but I could imagine wanting to provide 
pdf() and logpdf() methods eventually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Allow spark-daemon.sh to support foreground op...

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3881#issuecomment-68565114
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3861#issuecomment-68582212
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25002/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3861#issuecomment-68582211
  
  [Test build #25002 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25002/consoleFull)
 for   PR 3861 at commit 
[`a8d036c`](https://github.com/apache/spark/commit/a8d036cf6ec4b8b1fa621a4da955f3274517e41f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  cd %s*; %s ./bin/spark-class .format(basename, prefixEnv)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4014] Add TaskContext.attemptNumber and...

2015-01-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3849#issuecomment-68581193
  
Maybe that test was flaky; let's see if it passes again (it'll retest since 
I pushed a commit to fix a merge conflict).

I've updated this patch to not modify `attemptId` but to introduce 
`attemptNumber` and deprecate `attemptId`.  I think it will be confusing to 
have `attemptId` have different behavior in different branches, especially 
since it seems like functionality that might be nice to rely on when writing 
certain types of regression tests.  Since this patch doesn't change any 
behavior, I'd like to backport it to maintenance branches so that we can rely 
on it in test code.  If we decide to do that, the committer should update the 
MiMa exclusions on cherry-pick.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-794][Core] Remove sleep() in ClusterSch...

2015-01-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3851#issuecomment-68581430
  
@tdas, what do you think about merging this?  It looks like you were the 
last one to touch this line in 27311b13321ba60ee1324b86234f0aaf63df9f67.  This 
fixed `sleep()` seems race-prone enough that I suppose we would have noticed if 
it was necessary for anything because it would have caused test flakiness.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4835] Disable validateOutputSpecs for S...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3832#issuecomment-68582506
  
  [Test build #25003 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25003/consoleFull)
 for   PR 3832 at commit 
[`6485cf8`](https://github.com/apache/spark/commit/6485cf880465cf7bd8e501dc861869be58029995).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4014] Add TaskContext.attemptNumber and...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3849#issuecomment-68581165
  
  [Test build #25004 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25004/consoleFull)
 for   PR 3849 at commit 
[`eee6a45`](https://github.com/apache/spark/commit/eee6a4569d926d3ada2ca259ddb04906392688ae).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread tgaloppo
Github user tgaloppo commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68581355
  
@jkbradley I used Octave's mvnpdf from the statistics package for the 
non-singular cases; it can not handle singular covariance matrices, so I was 
only able to recreate the function using octaves' pinv() function.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4835] Disable validateOutputSpecs for S...

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3832#issuecomment-68582508
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25003/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-794][Core] Remove sleep() in ClusterSch...

2015-01-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3851#issuecomment-68581466
  
I'm inclined to merge this into `master` now and not perform any backports 
right away (maybe it's still serving some purpose in older branches?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5057]Add more details in log when using...

2015-01-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3875#issuecomment-68581252
  
I suppose it'd be nice to use string interpolation here, but I guess the 
old code didn't use it either, so this matches the surrounding style


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4631] unit test for MQTT

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3844#issuecomment-68525514
  
  [Test build #24994 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24994/consoleFull)
 for   PR 3844 at commit 
[`04503cf`](https://github.com/apache/spark/commit/04503cfa7f8168038c17198b6e45b16b89591e74).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class MQTTStreamSuite extends FunSuite with Eventually with 
BeforeAndAfter `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4631] unit test for MQTT

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3844#issuecomment-68525516
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24994/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Updated broken links

2015-01-02 Thread sigmoidanalytics
GitHub user sigmoidanalytics opened a pull request:

https://github.com/apache/spark/pull/3877

Updated broken links

Updated the broken link pointing to the KafkaWordCount example to the 
correct one.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sigmoidanalytics/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3877.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3877


commit 3e19b31890f8317550c28b60edc3f5ea3137776c
Author: sigmoidanalytics ma...@sigmoidanalytics.com
Date:   2015-01-02T10:44:34Z

Updated broken links

Updated the broken link pointing to the KafkaWordCount example to the 
correct one.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-01-02 Thread hxfeng
GitHub user hxfeng opened a pull request:

https://github.com/apache/spark/pull/3879

Merge pull request #1 from apache/master

update

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hxfeng/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3879.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3879


commit b3ee640ffa59ed14fdeba61d5bf53b9b8e6cc520
Author: hxfeng 980548...@qq.com
Date:   2014-12-28T03:51:35Z

Merge pull request #1 from apache/master

update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2165][YARN]add support for setting maxA...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3878#issuecomment-68525645
  
  [Test build #24995 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24995/consoleFull)
 for   PR 3878 at commit 
[`afdfc99`](https://github.com/apache/spark/commit/afdfc99e2722ac3a910de91dbf0c80972e7f7eb9).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4631] unit test for MQTT

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3844#issuecomment-68525682
  
  [Test build #24996 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24996/consoleFull)
 for   PR 3844 at commit 
[`4b34ee7`](https://github.com/apache/spark/commit/4b34ee784e7c9c489cf0c22d73311c160bc67c47).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2165][YARN]add support for setting maxA...

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3878#issuecomment-68525649
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24995/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4631] unit test for MQTT

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3844#issuecomment-68525686
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24996/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Updated broken links

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3877#issuecomment-68519009
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-01-02 Thread hxfeng
Github user hxfeng commented on the pull request:

https://github.com/apache/spark/pull/3879#issuecomment-68526458
  
update


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5050] Add unit test for sqdist

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3869#discussion_r22424322
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala ---
@@ -175,6 +175,42 @@ class VectorsSuite extends FunSuite {
 assert(v.size === x.rows)
   }
 
+  test(sqdist) {
+val a = (30 to 0 by -1).map(math.pow(2.0, _)).toArray
+val n = a.length
+val v1 = Vectors.dense(a)
+for (m - 0 until n) {
+  val indices = (0 to m).toArray
+  val values = indices.map(i = a(i))
+  val v2 = Vectors.sparse(n, indices, values)
+  val v3 = Vectors.sparse(n, indices, indices.map(i = a(i) + 0.5))
+
+  // DenseVector vs. SparseVector
+  val squaredDist = breezeSquaredDistance(v1.toBreeze, v2.toBreeze)
+  val fastSquaredDist1 = Vectors.sqdist(v1, v2)
+  assert(fastSquaredDist1 == squaredDist)
+
+  // DenseVector vs. DenseVector
+  val fastSquaredDist2 = Vectors.sqdist(v1, Vectors.dense(v2.toArray))
+  assert(fastSquaredDist2 === squaredDist)
+
+  // SparseVector vs. SparseVector
+  val squaredDist2 = breezeSquaredDistance(v2.toBreeze, v3.toBreeze)
+  val fastSquaredDist3 = Vectors.sqdist(v2, v3)
+  assert(fastSquaredDist3 === squaredDist2)
+
+  // SparseVector vs. SparseVector: with values at different indices
+  if (m  10) {
+val v4 = Vectors.sparse(n, indices.slice(0, m - 10),
+  indices.map(i = a(i) + 0.5).slice(0, m - 10))
+val squaredDist = breezeSquaredDistance(v2.toBreeze, v4.toBreeze)
+val fastSquaredDist =
--- End diff --

Can fit on 1 line instead of 2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5050] Add unit test for sqdist

2015-01-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3869#discussion_r22424333
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala ---
@@ -175,6 +175,42 @@ class VectorsSuite extends FunSuite {
 assert(v.size === x.rows)
   }
 
+  test(sqdist) {
+val a = (30 to 0 by -1).map(math.pow(2.0, _)).toArray
+val n = a.length
+val v1 = Vectors.dense(a)
+for (m - 0 until n) {
+  val indices = (0 to m).toArray
+  val values = indices.map(i = a(i))
+  val v2 = Vectors.sparse(n, indices, values)
+  val v3 = Vectors.sparse(n, indices, indices.map(i = a(i) + 0.5))
+
+  // DenseVector vs. SparseVector
+  val squaredDist = breezeSquaredDistance(v1.toBreeze, v2.toBreeze)
+  val fastSquaredDist1 = Vectors.sqdist(v1, v2)
+  assert(fastSquaredDist1 == squaredDist)
--- End diff --

```==``` can be ```===```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5061][Alex Baretta] SQLContext: overloa...

2015-01-02 Thread alexbaretta
GitHub user alexbaretta opened a pull request:

https://github.com/apache/spark/pull/3882

[SPARK-5061][Alex Baretta] SQLContext: overload createParquetFile

Overload of createParquetFile taking a StructType instead of a TypeTag

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/alexbaretta/spark createParquetFile

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3882.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3882


commit f6e40b50c4aca9372c51d1337d559fc9cf50108d
Author: Alex Baretta a...@planalechmy.com
Date:   2014-12-27T02:29:29Z

[Alex Baretta] SQLContext: overload createParquetFile

Overload taking a StructType instead of TypeTag




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5050] Add unit test for sqdist

2015-01-02 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3869#issuecomment-68566521
  
@viirya Looks OK to me, except for the tiny comments.  Thanks!

At some point, it might be nice to replace these tests with ones using 
random dense  sparse vectors (with random sparsity patterns).  If you are 
interested in doing that, I can send you a method for generating random sparse 
vectors which I used for the timing tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68566334
  
  [Test build #24998 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24998/consoleFull)
 for   PR 3871 at commit 
[`b4415ea`](https://github.com/apache/spark/commit/b4415ea70055e8ca2c0444cf964b696f0e1e410d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68565659
  
@tgaloppo The logic looks good; my comments are basically about clarity 
(except for the log space question).  Thanks for the PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...

2015-01-02 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3871#issuecomment-68565800
  
One more request: Could you please add a unit test with a singular matrix?  
Thank you!  Perhaps in a new suite for MultivariateGaussian


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5061][Alex Baretta] SQLContext: overloa...

2015-01-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3882#issuecomment-68567852
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >