[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/3861#discussion_r22417818 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -998,7 +998,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli */ @DeveloperApi override def requestExecutors(numAdditionalExecutors: Int): Boolean = { -assert(master.contains(yarn) || dynamicAllocationTesting, +assert(master.contains(mesos) || master.contains(yarn) || dynamicAllocationTesting, Requesting executors is currently only supported in YARN mode) --- End diff -- Change this message to be ... only supported in YARN or Mesos modes, and the message below --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5052] Add common/base classes to fix gu...
Github user elmer-garduno commented on the pull request: https://github.com/apache/spark/pull/3874#issuecomment-68535926 I tried that before using spark.files.userClassPathFirst, but it resulted in a java.lang.NoClassDefFoundError: org/apache/spark/Partition ([full stack trace](https://gist.github.com/elmer-garduno/e65e3d992357253c6111)), which seemed bad enough to not go that way, but maybe someone else here knows the correct way to achieve it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3876#issuecomment-68549947 [Test build #24997 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24997/consoleFull) for PR 3876 at commit [`e0cf9ef`](https://github.com/apache/spark/commit/e0cf9ef44a7c5b324158325d59acbea7236f9203). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user ash211 commented on the pull request: https://github.com/apache/spark/pull/3879#issuecomment-68549708 Hi @hxfeng did you mean to send this in? I don't see any code change, just an empty merge commit. Would you mind closing this pull request if it was sent accidentally? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5057]Add more details in log when using...
Github user ash211 commented on the pull request: https://github.com/apache/spark/pull/3875#issuecomment-68549854 Matches error message from 20 lines up, so LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md
Github user ash211 commented on the pull request: https://github.com/apache/spark/pull/3876#issuecomment-68549893 Jenkins this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3879#issuecomment-68526639 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Branch 1.2
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3880#issuecomment-68527633 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68572495 [Test build #24998 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24998/consoleFull) for PR 3871 at commit [`b4415ea`](https://github.com/apache/spark/commit/b4415ea70055e8ca2c0444cf964b696f0e1e410d). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] make-distribution.sh using build/mvn
Github user brennonyork commented on the pull request: https://github.com/apache/spark/pull/3867#issuecomment-68568947 Looks good to me. As an aside I remember @pwendell mentioning on the dev mailing list that all PR's *should* have an associated JIRA ticket. Is there one for this? If not, might be something you should add and link to. Not sure if they'll be closing future PR's without associated JIRA's. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68572497 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24998/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3325][Streaming] Add a parameter to the...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/3237#issuecomment-68571707 The other PR #3865 has been merged. Mind closing this PR? Thanks for all the effort! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/3876#issuecomment-68571771 Good catch. Merging this. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3325][Streaming] Add a parameter to the...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/3865#issuecomment-68571673 I have merged this. Thanks all! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5058] Updated broken links
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/3877#issuecomment-68575090 Jenkins, this is ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3876 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3325][Streaming] Add a parameter to the...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3865 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68580983 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25001/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22429027 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,69 @@ package org.apache.spark.mllib.stat.impl -import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, det, pinv} +import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, eigSym} -/** - * Utility class to implement the density function for multivariate Gaussian distribution. - * Breeze provides this functionality, but it requires the Apache Commons Math library, - * so this class is here so-as to not introduce a new dependency in Spark. - */ +import org.apache.spark.mllib.util.MLUtils + +/* + * This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + * the event that the covariance matrix is singular, the density will be computed in a + * reduced dimensional subspace under which the distribution is supported. + * (see http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case) + * + * @param mu The mean vector of the distribution + * @param sigma The covariance matrix of the distribution + */ private[mllib] class MultivariateGaussian( val mu: DBV[Double], val sigma: DBM[Double]) extends Serializable { - private val sigmaInv2 = pinv(sigma) * -0.5 - private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5) - + + /** + * Compute distribution dependent constants: + *sigmaInv2 = (-1/2) * inv(sigma) + *u = (2*pi)^(-k/2) * det(sigma)^(-1/2) + */ + private val (sigmaInv2: DBM[Double], u: Double) = calculateCovarianceConstants + /** Returns density of this multivariate Gaussian at given point, x */ def pdf(x: DBV[Double]): Double = { val delta = x - mu -val deltaTranspose = new Transpose(delta) -U * math.exp(deltaTranspose * sigmaInv2 * delta) +u * math.exp(delta.t * sigmaInv2 * delta) + } + + /** + * Calculate distribution dependent components used for the density function: + *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * (x-mu).t * inv(sigma) * (x-mu) ) + * where k is length of the mean vector. + * + * We here compute distribution-fixed parts + * (2*pi)^(-k/2) * det(sigma)^(-1/2) + * and + * (-1/2) * inv(sigma) + * + * Both the determinant and the inverse can be computed from the singular value decomposition + * of sigma. Noting that covariance matrices are always symmetric and positive semi-definite, + * we can use the eigendecomposition. + * + * To guard against singular covariance matrices, this method computes both the + * pseudo-determinant and the pseudo-inverse (Moore-Penrose). Singular values are considered + * to be non-zero only if they exceed a tolerance based on machine precision, matrix size, and + * relation to the maximum singular value (same tolerance used by, e.g., Octave). + */ + private def calculateCovarianceConstants: (DBM[Double], Double) = { +val eigSym.EigSym(d, u) = eigSym(sigma) // sigma = u * diag(d) * u.t + +// For numerical stability, values are considered to be non-zero only if they exceed tol. +// This prevents any inverted value from exceeding (eps * n * max(d))^-1 +val tol = MLUtils.EPSILON * max(d) * d.length + +// pseudo-determinant is product of all non-zero eigenvalues +val pdetSigma = d.activeValuesIterator.filter(_ tol).reduce(_ * _) --- End diff -- If all singular values are = tol, then this will throw an UnsupportedOperationException. Could you perhaps catch it and throw a more meaningful error if that happens? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22429029 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussianSuite.scala --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.impl + +import org.scalatest.FunSuite + +import org.apache.spark.mllib.linalg.{Vectors, Matrices} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class MultivariateGaussianSuite extends FunSuite with MLlibTestSparkContext { + test(univariate) { +val x = Vectors.dense(0.0).toBreeze.toDenseVector + +val mu = Vectors.dense(0.0).toBreeze.toDenseVector +var sigma = Matrices.dense(1, 1, Array(1.0)).toBreeze.toDenseMatrix +var dist = new MultivariateGaussian(mu, sigma) +assert(dist.pdf(x) ~== 0.39894 absTol 1E-5) + +sigma = Matrices.dense(1, 1, Array(4.0)).toBreeze.toDenseMatrix +dist = new MultivariateGaussian(mu, sigma) +assert(dist.pdf(x) ~== 0.19947 absTol 1E-5) + } + + test(multivariate) { +val x = Vectors.dense(0.0, 0.0).toBreeze.toDenseVector + +val mu = Vectors.dense(0.0, 0.0).toBreeze. toDenseVector --- End diff -- typo: space between . and toDenseVector --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22429025 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,69 @@ package org.apache.spark.mllib.stat.impl -import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, det, pinv} +import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, eigSym} -/** - * Utility class to implement the density function for multivariate Gaussian distribution. - * Breeze provides this functionality, but it requires the Apache Commons Math library, - * so this class is here so-as to not introduce a new dependency in Spark. - */ +import org.apache.spark.mllib.util.MLUtils + +/* + * This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + * the event that the covariance matrix is singular, the density will be computed in a + * reduced dimensional subspace under which the distribution is supported. + * (see http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case) + * + * @param mu The mean vector of the distribution + * @param sigma The covariance matrix of the distribution + */ private[mllib] class MultivariateGaussian( val mu: DBV[Double], val sigma: DBM[Double]) extends Serializable { - private val sigmaInv2 = pinv(sigma) * -0.5 - private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5) - + + /** + * Compute distribution dependent constants: + *sigmaInv2 = (-1/2) * inv(sigma) + *u = (2*pi)^(-k/2) * det(sigma)^(-1/2) + */ + private val (sigmaInv2: DBM[Double], u: Double) = calculateCovarianceConstants + /** Returns density of this multivariate Gaussian at given point, x */ def pdf(x: DBV[Double]): Double = { val delta = x - mu -val deltaTranspose = new Transpose(delta) -U * math.exp(deltaTranspose * sigmaInv2 * delta) +u * math.exp(delta.t * sigmaInv2 * delta) + } + + /** + * Calculate distribution dependent components used for the density function: + *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * (x-mu).t * inv(sigma) * (x-mu) ) + * where k is length of the mean vector. + * + * We here compute distribution-fixed parts + * (2*pi)^(-k/2) * det(sigma)^(-1/2) + * and + * (-1/2) * inv(sigma) + * + * Both the determinant and the inverse can be computed from the singular value decomposition + * of sigma. Noting that covariance matrices are always symmetric and positive semi-definite, + * we can use the eigendecomposition. + * + * To guard against singular covariance matrices, this method computes both the + * pseudo-determinant and the pseudo-inverse (Moore-Penrose). Singular values are considered + * to be non-zero only if they exceed a tolerance based on machine precision, matrix size, and + * relation to the maximum singular value (same tolerance used by, e.g., Octave). + */ + private def calculateCovarianceConstants: (DBM[Double], Double) = { +val eigSym.EigSym(d, u) = eigSym(sigma) // sigma = u * diag(d) * u.t + +// For numerical stability, values are considered to be non-zero only if they exceed tol. +// This prevents any inverted value from exceeding (eps * n * max(d))^-1 +val tol = MLUtils.EPSILON * max(d) * d.length + +// pseudo-determinant is product of all non-zero eigenvalues --- End diff -- eigenvalues -- singular values (here and in the next comment on line 79) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22429030 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussianSuite.scala --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.impl + +import org.scalatest.FunSuite + +import org.apache.spark.mllib.linalg.{Vectors, Matrices} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class MultivariateGaussianSuite extends FunSuite with MLlibTestSparkContext { + test(univariate) { +val x = Vectors.dense(0.0).toBreeze.toDenseVector + +val mu = Vectors.dense(0.0).toBreeze.toDenseVector +var sigma = Matrices.dense(1, 1, Array(1.0)).toBreeze.toDenseMatrix +var dist = new MultivariateGaussian(mu, sigma) +assert(dist.pdf(x) ~== 0.39894 absTol 1E-5) + +sigma = Matrices.dense(1, 1, Array(4.0)).toBreeze.toDenseMatrix +dist = new MultivariateGaussian(mu, sigma) +assert(dist.pdf(x) ~== 0.19947 absTol 1E-5) + } + + test(multivariate) { +val x = Vectors.dense(0.0, 0.0).toBreeze.toDenseVector + +val mu = Vectors.dense(0.0, 0.0).toBreeze. toDenseVector +var sigma = Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1.0)).toBreeze.toDenseMatrix +var dist = new MultivariateGaussian(mu, sigma) +assert(dist.pdf(x) ~== 0.15915 absTol 1E-5) + +sigma = Matrices.dense(2, 2, Array(4.0, -1.0, -1.0, 2.0)).toBreeze.toDenseMatrix +dist = new MultivariateGaussian(mu, sigma) +assert(dist.pdf(x) ~== 0.060155 absTol 1E-5) + } + + test(multivariate degenerate) { +val x = Vectors.dense(0.0, 0.0).toBreeze.toDenseVector + +val mu = Vectors.dense(0.0, 0.0).toBreeze. toDenseVector --- End diff -- typo: space between . and toDenseVector --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4631] unit test for MQTT
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/3844#discussion_r22429050 --- Diff: external/mqtt/src/test/scala/org/apache/spark/streaming/mqtt/MQTTStreamSuite.scala --- @@ -17,31 +17,114 @@ package org.apache.spark.streaming.mqtt -import org.scalatest.FunSuite +import java.net.{URI, ServerSocket} -import org.apache.spark.streaming.{Seconds, StreamingContext} +import org.apache.activemq.broker.{TransportConnector, BrokerService} +import org.apache.spark.util.Utils +import org.scalatest.{BeforeAndAfter, FunSuite} +import org.scalatest.concurrent.Eventually +import scala.concurrent.duration._ +import org.apache.spark.streaming.{Milliseconds, StreamingContext} import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.dstream.ReceiverInputDStream +import org.eclipse.paho.client.mqttv3._ +import org.eclipse.paho.client.mqttv3.persist.MqttDefaultFilePersistence -class MQTTStreamSuite extends FunSuite { - - val batchDuration = Seconds(1) +class MQTTStreamSuite extends FunSuite with Eventually with BeforeAndAfter { + private val batchDuration = Milliseconds(500) private val master: String = local[2] - private val framework: String = this.getClass.getSimpleName + private val freePort = findFreePort() + private val brokerUri = //localhost: + freePort + private val topic = def + private var ssc: StreamingContext = _ + private val persistenceDir = Utils.createTempDir() + private var broker: BrokerService = _ + private var connector: TransportConnector = _ - test(mqtt input stream) { -val ssc = new StreamingContext(master, framework, batchDuration) -val brokerUrl = abc -val topic = def + before { +ssc = new StreamingContext(master, framework, batchDuration) +setupMQTT + } -// tests the API, does not actually test data receiving -val test1: ReceiverInputDStream[String] = MQTTUtils.createStream(ssc, brokerUrl, topic) -val test2: ReceiverInputDStream[String] = - MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_AND_DISK_SER_2) + after { +if (ssc != null) { + ssc.stop() + ssc = null +} +Utils.deleteRecursively(persistenceDir) +tearDownMQTT + } -// TODO: Actually test receiving data + test(mqtt input stream) { +val sendMessage = MQTT demo for spark streaming +val receiveStream: ReceiverInputDStream[String] = + MQTTUtils.createStream(ssc, tcp: + brokerUri, topic, StorageLevel.MEMORY_ONLY) +var receiveMessage: List[String] = List() +receiveStream.foreachRDD { rdd = + if (rdd.collect.length 0) { +receiveMessage = receiveMessage ::: List(rdd.first) +receiveMessage + } +} +ssc.start() +publishData(sendMessage) +eventually(timeout(1 milliseconds), interval(100 milliseconds)) { + assert(sendMessage.equals(receiveMessage(0))) +} ssc.stop() } + + private def setupMQTT() { +broker = new BrokerService() +connector = new TransportConnector() +connector.setName(mqtt) +connector.setUri(new URI(mqtt: + brokerUri)) +broker.addConnector(connector) +broker.start() + } + + private def tearDownMQTT() { +if (broker != null) { + broker.stop() + broker = null +} +if (connector != null) { + connector.stop() + connector = null +} + } + + private def findFreePort(): Int = { +Utils.startServiceOnPort(23456, (trialPort: Int) = { + val socket = new ServerSocket(trialPort) + socket.close() + (null, trialPort) +})._2 + } + + def publishData(data: String): Unit = { +var client: MqttClient = null +try { + val persistence: MqttClientPersistence = new MqttDefaultFilePersistence(persistenceDir.getAbsolutePath) + client = new MqttClient(tcp: + brokerUri, MqttClient.generateClientId(), persistence) + client.connect() + if (client.isConnected) { +val msgTopic: MqttTopic = client.getTopic(topic) +val message: MqttMessage = new MqttMessage(data.getBytes(utf-8)) +message.setQos(1) +message.setRetained(true) +for (i - 0 to 10) + msgTopic.publish(message) + } +} catch { + case e: MqttException = println(Exception Caught: + e) --- End diff -- Why can there be an exception? And if there is an exception, why is it being ignored? Printing and not doing anything is essentially ignoring if the unit test
[GitHub] spark pull request: [SPARK-4835] Disable validateOutputSpecs for S...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3832#issuecomment-68580794 @tdas I've updated this PR and added a test case. My test case uses calls inside of a `transform()` call to emulate what Streaming's `saveAsHadoopFiles` operation does. Is this a valid use of `transform()` or am I breaking rules by having actions in my transform function? My gut says that we shouldn't endorse / recommend this for the same reason that we advise against using accumulators inside of map() tasks: the transform call might get evaluated multiple times if caching isn't use, which makes it possible to write programs whose behavior changes depending on whether caching is enabled. I wasn't able to get the existing recovery with saveAsNewAPIHadoopFiles operation test to fail, though, even though I discovered this bug while refactoring that test in my other PR. I think that the issue is that the failed `saveAsNewAPIHadoopFiles` jobs failed but did not trigger a failure of the other actions / transformations in that batch, so we still got the correct output even though the batch completion event wasn't posted to the listener bus. The current tests rely on wall-clock time to detect when batches have been processed and hence didn't detect that the batch completion event was missing. I noticed that the StreamingListener API doesn't really have any events for job / batch failures, but that's a topic for a separate PR. I was about to write that this bug might not actually affect users who don't use `transform` but it still impacts users in the partial-failure case where they've used PairDStreamFunctions.saveAsHadoopFiles() but a batch fails with partially-written output: an individual output _partition_ might be atomically committed to the output directory (e.g. if the file exists, then it has the right contents), but I think we can still wind up in a scenario where only a subset of the partitions are written and the non-empty output directory prevents the recovery from recomputing the missing partitions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22429041 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,69 @@ package org.apache.spark.mllib.stat.impl -import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, det, pinv} +import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, eigSym} -/** - * Utility class to implement the density function for multivariate Gaussian distribution. - * Breeze provides this functionality, but it requires the Apache Commons Math library, - * so this class is here so-as to not introduce a new dependency in Spark. - */ +import org.apache.spark.mllib.util.MLUtils + +/* --- End diff -- Use ```/**``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68580981 [Test build #25001 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25001/consoleFull) for PR 3871 at commit [`d448137`](https://github.com/apache/spark/commit/d448137b739691c152dd981f136cef62b65d4e50). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4835] Disable validateOutputSpecs for S...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3832#issuecomment-68580537 [Test build #25003 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25003/consoleFull) for PR 3832 at commit [`6485cf8`](https://github.com/apache/spark/commit/6485cf880465cf7bd8e501dc861869be58029995). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68581031 @tgaloppo Thanks for the updates. Sure, the log-space computation could be in another PR. Just to make sure: Did you compute the PDF values in the tests using other software? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4631] unit test for MQTT
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/3844#issuecomment-68581068 This is almost looking good. few more comments and we are ready. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5057]Add more details in log when using...
GitHub user WangTaoTheTonic opened a pull request: https://github.com/apache/spark/pull/3875 [SPARK-5057]Add more details in log when using actor to get infos https://issues.apache.org/jira/browse/SPARK-5057 You can merge this pull request into a Git repository by running: $ git pull https://github.com/WangTaoTheTonic/spark SPARK-5057 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3875.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3875 commit 706c8a7d02a07bfc6b096221777f44eabc36467b Author: WangTaoTheTonic barneystin...@aliyun.com Date: 2015-01-02T10:20:41Z log more messages --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md
GitHub user akhld opened a pull request: https://github.com/apache/spark/pull/3876 Fixed typos in streaming-kafka-integration.md Changed projrect to project :) You can merge this pull request into a Git repository by running: $ git pull https://github.com/akhld/spark patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3876.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3876 commit e0cf9ef44a7c5b324158325d59acbea7236f9203 Author: Akhil Das ak...@darktech.ca Date: 2015-01-02T10:32:12Z Fixed typos in streaming-kafka-integration.md Changed projrect to project :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5052] Add common/base classes to fix gu...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3874#issuecomment-68514968 You're right that I think this is too broad. I think I misspoke earlier. Isn't the theory here that you can bring a later version if Optional with you in your app? Spark barely uses its API. If your copy of Optional hides the one in Spark, which is only there to keep the signature the same, is that OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3741#issuecomment-68517965 [Test build #24992 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24992/consoleFull) for PR 3741 at commit [`46ad71e`](https://github.com/apache/spark/commit/46ad71ed44df4f1dbea7614ae2057ab1d6207ab4). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3741#issuecomment-68517724 [Test build #24991 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24991/consoleFull) for PR 3741 at commit [`1b047e6`](https://github.com/apache/spark/commit/1b047e6cefb652e8ce4d2cf0cbd57bcc84654370). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5057]Add more details in log when using...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3875#issuecomment-68518118 [Test build #24993 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24993/consoleFull) for PR 3875 at commit [`706c8a7`](https://github.com/apache/spark/commit/706c8a7d02a07bfc6b096221777f44eabc36467b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3741#issuecomment-68517726 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24991/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3741#issuecomment-68517967 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24992/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3876#issuecomment-68518565 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/3861#discussion_r22418308 --- Diff: core/src/main/scala/org/apache/spark/executor/CoarseGrainedMesosExecutorBackend.scala --- @@ -0,0 +1,212 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.executor + +import org.apache.spark.{SparkConf, Logging, SecurityManager} +import org.apache.mesos.{Executor = MesosExecutor, ExecutorDriver, MesosExecutorDriver, MesosNativeLibrary} +import org.apache.spark.util.{Utils, SignalLogger} +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.mesos.Protos._ +import org.apache.spark.deploy.worker.StandaloneWorkerShuffleService +import scala.collection.JavaConversions._ +import scala.io.Source +import java.io.{File, PrintWriter} + +/** + * The Coarse grained Mesos executor backend is responsible for launching the shuffle service + * and the CoarseGrainedExecutorBackend actor. + * This is assuming the scheduler detected that the shuffle service is enabled and launches + * this class instead of CoarseGrainedExecutorBackend directly. + */ +private[spark] class CoarseGrainedMesosExecutorBackend(val sparkConf: SparkConf) + extends MesosExecutor + with Logging { + + private var shuffleService: StandaloneWorkerShuffleService = null + private var driver: ExecutorDriver = null + private var executorProc: Process = null + private var taskId: TaskID = null + @volatile var killed = false + + override def registered( + driver: ExecutorDriver, + executorInfo: ExecutorInfo, + frameworkInfo: FrameworkInfo, + slaveInfo: SlaveInfo) { +this.driver = driver +logInfo(Coarse Grain Mesos Executor ' + executorInfo.getExecutorId.getValue + --- End diff -- Grained --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22420650 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} + +private[spark] class NoOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = 0 +} + +private[spark] class KeyValueOrdering[A, B]( + ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] +) extends Ordering[Product2[A, B]] { + private val ord1 = ordering1.getOrElse(new HashOrdering[A]) + private val ord2 = ordering2.getOrElse(new NoOrdering[B]) --- End diff -- What is the expected scenario in which a `KeyValueOrdering` is called for with `B` unordered? You're setting up `KeyValueOrdering` to be more general than your needs for its only current usage in `OrderedValueRDDFunctions`, but I'm not quite grasping how and where else you are expecting `KeyValueOrdering` to be used. It's seeming to me that `KeyValueOrdering` should have two ctors: ```scala KeyValueOrdering[A, B](keyOrdering: Ordering[A], valueOrdering: Ordering[B]) ... this(valueOrdering: Ordering[B]) = this(new HashOrdering[A], valueOrdering) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22422723 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} + +private[spark] class NoOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = 0 +} + +private[spark] class KeyValueOrdering[A, B]( + ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] +) extends Ordering[Product2[A, B]] { + private val ord1 = ordering1.getOrElse(new HashOrdering[A]) + private val ord2 = ordering2.getOrElse(new NoOrdering[B]) + + override def compare(x: Product2[A, B], y: Product2[A, B]): Int = { +val c1 = ord1.compare(x._1, y._1) +if (c1 != 0) c1 else ord2.compare(x._2, y._2) --- End diff -- What happens when `ord1` is `HashOrdering` and `c1 == 0` but `x._1 != y._1`? More generally, what happens when `ord1` isn't actually a full ordering? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22421802 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} --- End diff -- `ExternalSorter#keyComparator` should be refactored to use `spark.util.HashOrdering`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22422719 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { --- End diff -- This isn't actually true. The `compare` method only produces a partial ordering. `ExternalSorter#keyComparator` gets away with the `Ordering[K]` falsehood only because later passes resolve hash collisions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3876#issuecomment-68557428 [Test build #24997 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24997/consoleFull) for PR 3876 at commit [`e0cf9ef`](https://github.com/apache/spark/commit/e0cf9ef44a7c5b324158325d59acbea7236f9203). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fixed typos in streaming-kafka-integration.md
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3876#issuecomment-68557433 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24997/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68562386 @tgaloppo Could you please add a description? It can be based off of the JIRA, just enough to cover the main points of the PR. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Branch 1.2
Github user ash211 commented on the pull request: https://github.com/apache/spark/pull/3880#issuecomment-68550117 Hi @hxfeng I think this might be an accidental pull request -- merging 1.2 back into master would be a huge change! Would you mind closing this PR? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22428452 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} + +private[spark] class NoOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = 0 +} + +private[spark] class KeyValueOrdering[A, B]( + ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] +) extends Ordering[Product2[A, B]] { + private val ord1 = ordering1.getOrElse(new HashOrdering[A]) + private val ord2 = ordering2.getOrElse(new NoOrdering[B]) + + override def compare(x: Product2[A, B], y: Product2[A, B]): Int = { +val c1 = ord1.compare(x._1, y._1) +if (c1 != 0) c1 else ord2.compare(x._2, y._2) --- End diff -- i see 2 options: 1) do something similar to what happens in ExternalSorter.mergeWithAggregation where in groupByKeyAndSortValues i am aware of the fact that i might be processing multiple keys (with same hashCode) at once and check for key equality. this increases memory requirements (all values for all keys with same hashCode have to fit in memory as opposed to all values for a single key). 2) require an ordering for K which can be used as a tie breaker when the hashCodes of the keys are the same, so that i have a total ordering for K. thoughts? i will add a unit test where i have multiple keys with the same hashCode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5061][Alex Baretta] SQLContext: overloa...
Github user alexbaretta commented on a diff in the pull request: https://github.com/apache/spark/pull/3882#discussion_r22428318 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -269,6 +269,43 @@ class SQLContext(@transient val sparkContext: SparkContext) path, ScalaReflection.attributesFor[A], allowExisting, conf, this)) } + + /** + * :: Experimental :: + * Creates an empty parquet file with the provided schema. The parquet file thus created + * can be registered as a table, which can then be used as the target of future + * `insertInto` operations. + * + * {{{ + * val sqlContext = new SQLContext(...) + * import sqlContext._ + * + * val schema = StructType(List(StructField(name, StringType),StructField(age, IntegerType))) + * createParquetFile(schema, path/to/file.parquet).registerTempTable(people) + * sql(INSERT INTO people SELECT 'michael', 29) + * }}} + * + * @param schema StructType describing the records to be stored in the Parquet file. + * @param path The path where the directory containing parquet metadata should be created. + * Data inserted into this table will also be stored at this location. + * @param allowExisting When false, an exception will be thrown if this directory already exists. + * @param conf A Hadoop configuration object that can be used to specify options to the parquet + * output format. + * + * @group userf + */ + @Experimental + def createParquetFile( --- End diff -- Andrew, OK, but keep in mind that my patch overloads an existing method. If you think createParquetFile should be renamed to createEmptyParquetFile you should probably file a separate JIRA. Also, arguably creating a file implies that it is empty. Alex On Jan 2, 2015 5:11 PM, Andrew Ash notificati...@github.com wrote: In sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala https://github.com/apache/spark/pull/3882#discussion-diff-22428199: + * val schema = StructType(List(StructField(name, StringType),StructField(age, IntegerType))) + * createParquetFile(schema, path/to/file.parquet).registerTempTable(people) + * sql(INSERT INTO people SELECT 'michael', 29) + * }}} + * + * @param schema StructType describing the records to be stored in the Parquet file. + * @param path The path where the directory containing parquet metadata should be created. + * Data inserted into this table will also be stored at this location. + * @param allowExisting When false, an exception will be thrown if this directory already exists. + * @param conf A Hadoop configuration object that can be used to specify options to the parquet + * output format. + * + * @group userf + */ + @Experimental + def createParquetFile( I kind of think createEmptyParquetFile would be a better name for this method, since most Parquet files have data I'd think — Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3882/files#r22428199. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5062][Graphx] replace mapReduceTriplets...
GitHub user shijinkui opened a pull request: https://github.com/apache/spark/pull/3883 [SPARK-5062][Graphx] replace mapReduceTriplets with aggregateMessage in Pregel Api since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it improve the performance indeed. it's time to replace mapReduceTriplets with aggregateMessage in Pregel. i provide a deprecated method thinking about compatibility -- i have draw a graph of aggregateMessage to show why it can improve the performance. ![graphx_aggreate_msg](https://cloud.githubusercontent.com/assets/648508/5601161/0444efdc-932b-11e4-8944-8e132339be9b.jpg) dfgdfgd You can merge this pull request into a Git repository by running: $ git pull https://github.com/shijinkui/spark pregel_agg Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3883.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3883 commit 93ae74bc5c9011719775e9862f257c2e81a9 Author: 玄畅 jinkui@alibaba-inc.com Date: 2015-01-01T02:43:27Z change mapReduceTriplets to aggregateMessages of Pregel API commit d2519e235c53c8ee53c5f127cf680585f139eb0c Author: 玄畅 jinkui@alibaba-inc.com Date: 2015-01-01T03:21:30Z change mapReduceTriplets to aggregateMessages of Pregel API --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5058] Updated broken links
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3877#issuecomment-68577996 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24999/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5058] Updated broken links
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3877#issuecomment-68577995 [Test build #24999 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24999/consoleFull) for PR 3877 at commit [`3e19b31`](https://github.com/apache/spark/commit/3e19b31890f8317550c28b60edc3f5ea3137776c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5062][Graphx] replace mapReduceTriplets...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3883#issuecomment-68578615 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68578626 [Test build #25001 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25001/consoleFull) for PR 3871 at commit [`d448137`](https://github.com/apache/spark/commit/d448137b739691c152dd981f136cef62b65d4e50). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5061][Alex Baretta] SQLContext: overloa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3882#issuecomment-68577765 [Test build #25000 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25000/consoleFull) for PR 3882 at commit [`f6e40b5`](https://github.com/apache/spark/commit/f6e40b50c4aca9372c51d1337d559fc9cf50108d). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3861#issuecomment-68580156 [Test build #25002 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25002/consoleFull) for PR 3861 at commit [`a8d036c`](https://github.com/apache/spark/commit/a8d036cf6ec4b8b1fa621a4da955f3274517e41f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5061][Alex Baretta] SQLContext: overloa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3882#issuecomment-68577767 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25000/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...
Github user tnachen commented on the pull request: https://github.com/apache/spark/pull/3861#issuecomment-68580094 @ash211 Thanks for the review, updated the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3741#issuecomment-68514411 [Test build #24992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24992/consoleFull) for PR 3741 at commit [`46ad71e`](https://github.com/apache/spark/commit/46ad71ed44df4f1dbea7614ae2057ab1d6207ab4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4465] runAsSparkUser doesn't affect Tas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3741#issuecomment-68514220 [Test build #24991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24991/consoleFull) for PR 3741 at commit [`1b047e6`](https://github.com/apache/spark/commit/1b047e6cefb652e8ce4d2cf0cbd57bcc84654370). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22423972 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,62 @@ package org.apache.spark.mllib.stat.impl -import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, det, pinv} +import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, eigSym} -/** - * Utility class to implement the density function for multivariate Gaussian distribution. - * Breeze provides this functionality, but it requires the Apache Commons Math library, - * so this class is here so-as to not introduce a new dependency in Spark. - */ +import org.apache.spark.mllib.util.MLUtils + +/* + * This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution + * + * @param mu The mean vector of the distribution + * @param sigma The covariance matrix of the distribution + */ private[mllib] class MultivariateGaussian( val mu: DBV[Double], val sigma: DBM[Double]) extends Serializable { - private val sigmaInv2 = pinv(sigma) * -0.5 - private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5) - + + private val (sigmaInv2, u) = calculateCovarianceConstants + /** Returns density of this multivariate Gaussian at given point, x */ def pdf(x: DBV[Double]): Double = { val delta = x - mu -val deltaTranspose = new Transpose(delta) -U * math.exp(deltaTranspose * sigmaInv2 * delta) +u * math.exp(delta.t * sigmaInv2 * delta) + } + + /* + * Calculate distribution dependent components used for the density function: + *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * (x-mu).t * inv(sigma) * (x-mu) ) + * where k is length of the mean vector. + * + * We here compute distribution-fixed parts + * (2*pi)^(-k/2) * det(sigma)^(-1/2) + * and + * (-1/2) * inv(sigma) + * + * Both the determinant and the inverse can be computed from the singular value decomposition + * of sigma. Noting that covariance matrices are always symmetric and positive semi-definite, + * we can use the eigendecomposition (breeze provides one specifically for symmetric matrices, + * so I am making an assumption here that there is some efficiency gain). + * + * To guard against singular covariance matrices, this method computes both the + * pseudo-determinant and the pseudo-inverse (Moore-Penrose). Singular values are considered + * to be non-zero only if they exceed a tolerance based on machine precision, matrix size, and + * relation to the maximum singular value (same tolerance used by, ie, Octave). + */ + private def calculateCovarianceConstants: (DBM[Double], Double) = { +val eigSym.EigSym(d, u) = eigSym(sigma) // sigma = u * diag(d) * u.t + +// For numerical stability, values are considered to be non-zero only if they exceed tol. +// This prevents any inverted value from exceeding (eps * n * max(d))^-1 +val tol = MLUtils.EPSILON * max(d) * d.length + +// pseudo-determinant is product of all non-zero eigenvalues +val pdetSigma = (0 until d.length).map(i = if (d(i) tol) d(i) else 1.0).reduce(_ * _) + +// calculate pseudo-inverse by inverting all non-zero eigenvalues +val pinvS = new DBV((0 until d.length).map(i = if (d(i) tol) (1.0 / d(i)) else 0.0).toArray) --- End diff -- This too can be more concise. You generally do not need to use the ```(0 until length).map``` pattern unless you need the indices; it is easier to map the values of an array like d directly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22423967 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,62 @@ package org.apache.spark.mllib.stat.impl -import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, det, pinv} +import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, eigSym} -/** - * Utility class to implement the density function for multivariate Gaussian distribution. - * Breeze provides this functionality, but it requires the Apache Commons Math library, - * so this class is here so-as to not introduce a new dependency in Spark. - */ +import org.apache.spark.mllib.util.MLUtils + +/* + * This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution + * + * @param mu The mean vector of the distribution + * @param sigma The covariance matrix of the distribution + */ private[mllib] class MultivariateGaussian( val mu: DBV[Double], val sigma: DBM[Double]) extends Serializable { - private val sigmaInv2 = pinv(sigma) * -0.5 - private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5) - + + private val (sigmaInv2, u) = calculateCovarianceConstants + /** Returns density of this multivariate Gaussian at given point, x */ def pdf(x: DBV[Double]): Double = { val delta = x - mu -val deltaTranspose = new Transpose(delta) -U * math.exp(deltaTranspose * sigmaInv2 * delta) +u * math.exp(delta.t * sigmaInv2 * delta) + } + + /* + * Calculate distribution dependent components used for the density function: + *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * (x-mu).t * inv(sigma) * (x-mu) ) + * where k is length of the mean vector. + * + * We here compute distribution-fixed parts + * (2*pi)^(-k/2) * det(sigma)^(-1/2) + * and + * (-1/2) * inv(sigma) + * + * Both the determinant and the inverse can be computed from the singular value decomposition + * of sigma. Noting that covariance matrices are always symmetric and positive semi-definite, + * we can use the eigendecomposition (breeze provides one specifically for symmetric matrices, + * so I am making an assumption here that there is some efficiency gain). + * + * To guard against singular covariance matrices, this method computes both the + * pseudo-determinant and the pseudo-inverse (Moore-Penrose). Singular values are considered + * to be non-zero only if they exceed a tolerance based on machine precision, matrix size, and + * relation to the maximum singular value (same tolerance used by, ie, Octave). --- End diff -- ie -- e.g. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22423970 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,62 @@ package org.apache.spark.mllib.stat.impl -import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, det, pinv} +import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, eigSym} -/** - * Utility class to implement the density function for multivariate Gaussian distribution. - * Breeze provides this functionality, but it requires the Apache Commons Math library, - * so this class is here so-as to not introduce a new dependency in Spark. - */ +import org.apache.spark.mllib.util.MLUtils + +/* + * This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution + * + * @param mu The mean vector of the distribution + * @param sigma The covariance matrix of the distribution + */ private[mllib] class MultivariateGaussian( val mu: DBV[Double], val sigma: DBM[Double]) extends Serializable { - private val sigmaInv2 = pinv(sigma) * -0.5 - private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5) - + + private val (sigmaInv2, u) = calculateCovarianceConstants + /** Returns density of this multivariate Gaussian at given point, x */ def pdf(x: DBV[Double]): Double = { val delta = x - mu -val deltaTranspose = new Transpose(delta) -U * math.exp(deltaTranspose * sigmaInv2 * delta) +u * math.exp(delta.t * sigmaInv2 * delta) + } + + /* + * Calculate distribution dependent components used for the density function: + *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * (x-mu).t * inv(sigma) * (x-mu) ) + * where k is length of the mean vector. + * + * We here compute distribution-fixed parts + * (2*pi)^(-k/2) * det(sigma)^(-1/2) + * and + * (-1/2) * inv(sigma) + * + * Both the determinant and the inverse can be computed from the singular value decomposition + * of sigma. Noting that covariance matrices are always symmetric and positive semi-definite, + * we can use the eigendecomposition (breeze provides one specifically for symmetric matrices, + * so I am making an assumption here that there is some efficiency gain). + * + * To guard against singular covariance matrices, this method computes both the + * pseudo-determinant and the pseudo-inverse (Moore-Penrose). Singular values are considered + * to be non-zero only if they exceed a tolerance based on machine precision, matrix size, and + * relation to the maximum singular value (same tolerance used by, ie, Octave). + */ + private def calculateCovarianceConstants: (DBM[Double], Double) = { +val eigSym.EigSym(d, u) = eigSym(sigma) // sigma = u * diag(d) * u.t + +// For numerical stability, values are considered to be non-zero only if they exceed tol. +// This prevents any inverted value from exceeding (eps * n * max(d))^-1 +val tol = MLUtils.EPSILON * max(d) * d.length + +// pseudo-determinant is product of all non-zero eigenvalues +val pdetSigma = (0 until d.length).map(i = if (d(i) tol) d(i) else 1.0).reduce(_ * _) --- End diff -- More concise: ``` val pdetSigma = d.activeValuesIterator.filter(_ tol).foldLeft(1.0)(_ * _) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22423959 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,62 @@ package org.apache.spark.mllib.stat.impl -import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, det, pinv} +import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, eigSym} -/** - * Utility class to implement the density function for multivariate Gaussian distribution. - * Breeze provides this functionality, but it requires the Apache Commons Math library, - * so this class is here so-as to not introduce a new dependency in Spark. - */ +import org.apache.spark.mllib.util.MLUtils + +/* + * This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution --- End diff -- Perhaps you could add a note here about how this behaves when sigma is singular, plus a reference like [http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case] The doc could be a short version of what you have below for ```calculateCovarianceConstants``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22423964 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,62 @@ package org.apache.spark.mllib.stat.impl -import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, det, pinv} +import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, eigSym} -/** - * Utility class to implement the density function for multivariate Gaussian distribution. - * Breeze provides this functionality, but it requires the Apache Commons Math library, - * so this class is here so-as to not introduce a new dependency in Spark. - */ +import org.apache.spark.mllib.util.MLUtils + +/* + * This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution + * + * @param mu The mean vector of the distribution + * @param sigma The covariance matrix of the distribution + */ private[mllib] class MultivariateGaussian( val mu: DBV[Double], val sigma: DBM[Double]) extends Serializable { - private val sigmaInv2 = pinv(sigma) * -0.5 - private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5) - + + private val (sigmaInv2, u) = calculateCovarianceConstants + /** Returns density of this multivariate Gaussian at given point, x */ def pdf(x: DBV[Double]): Double = { val delta = x - mu -val deltaTranspose = new Transpose(delta) -U * math.exp(deltaTranspose * sigmaInv2 * delta) +u * math.exp(delta.t * sigmaInv2 * delta) + } + + /* + * Calculate distribution dependent components used for the density function: + *pdf(x) = (2*pi)^(-k/2) * det(sigma)^(-1/2) * exp( (-1/2) * (x-mu).t * inv(sigma) * (x-mu) ) + * where k is length of the mean vector. + * + * We here compute distribution-fixed parts + * (2*pi)^(-k/2) * det(sigma)^(-1/2) + * and + * (-1/2) * inv(sigma) + * + * Both the determinant and the inverse can be computed from the singular value decomposition + * of sigma. Noting that covariance matrices are always symmetric and positive semi-definite, + * we can use the eigendecomposition (breeze provides one specifically for symmetric matrices, --- End diff -- No need to comment on Breeze here; you can in the PR description if it's in question. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22423961 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,62 @@ package org.apache.spark.mllib.stat.impl -import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, det, pinv} +import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, eigSym} -/** - * Utility class to implement the density function for multivariate Gaussian distribution. - * Breeze provides this functionality, but it requires the Apache Commons Math library, - * so this class is here so-as to not introduce a new dependency in Spark. - */ +import org.apache.spark.mllib.util.MLUtils + +/* + * This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution + * + * @param mu The mean vector of the distribution + * @param sigma The covariance matrix of the distribution + */ private[mllib] class MultivariateGaussian( val mu: DBV[Double], val sigma: DBM[Double]) extends Serializable { - private val sigmaInv2 = pinv(sigma) * -0.5 - private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5) - + + private val (sigmaInv2, u) = calculateCovarianceConstants --- End diff -- Could you please add explicit types + documentation here for clarity? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3871#discussion_r22423962 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala --- @@ -17,23 +17,62 @@ package org.apache.spark.mllib.stat.impl -import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, Transpose, det, pinv} +import breeze.linalg.{DenseVector = DBV, DenseMatrix = DBM, max, diag, eigSym} -/** - * Utility class to implement the density function for multivariate Gaussian distribution. - * Breeze provides this functionality, but it requires the Apache Commons Math library, - * so this class is here so-as to not introduce a new dependency in Spark. - */ +import org.apache.spark.mllib.util.MLUtils + +/* + * This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution + * + * @param mu The mean vector of the distribution + * @param sigma The covariance matrix of the distribution + */ private[mllib] class MultivariateGaussian( val mu: DBV[Double], val sigma: DBM[Double]) extends Serializable { - private val sigmaInv2 = pinv(sigma) * -0.5 - private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5) - + + private val (sigmaInv2, u) = calculateCovarianceConstants + /** Returns density of this multivariate Gaussian at given point, x */ def pdf(x: DBV[Double]): Double = { val delta = x - mu -val deltaTranspose = new Transpose(delta) -U * math.exp(deltaTranspose * sigmaInv2 * delta) +u * math.exp(delta.t * sigmaInv2 * delta) + } + + /* --- End diff -- Use ```/**``` for class/method documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22423736 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} + +private[spark] class NoOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = 0 +} + +private[spark] class KeyValueOrdering[A, B]( + ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] +) extends Ordering[Product2[A, B]] { + private val ord1 = ordering1.getOrElse(new HashOrdering[A]) + private val ord2 = ordering2.getOrElse(new NoOrdering[B]) + + override def compare(x: Product2[A, B], y: Product2[A, B]): Int = { +val c1 = ord1.compare(x._1, y._1) +if (c1 != 0) c1 else ord2.compare(x._2, y._2) --- End diff -- good point that doesn't look right. it could lead to keys being interleaved in the output. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22423573 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} + +private[spark] class NoOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = 0 +} + +private[spark] class KeyValueOrdering[A, B]( + ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] +) extends Ordering[Product2[A, B]] { + private val ord1 = ordering1.getOrElse(new HashOrdering[A]) + private val ord2 = ordering2.getOrElse(new NoOrdering[B]) --- End diff -- yeah thats right i copied it from another pullreq by me that needed a more general version. i can simplify it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Allow spark-daemon.sh to support foreground op...
GitHub user hellertime opened a pull request: https://github.com/apache/spark/pull/3881 Allow spark-daemon.sh to support foreground operation Add `--foreground` option to spark-daemon.sh to prevent the process from daemonizing itself. Useful if running under a watchdog which waits on its child process. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hellertime/spark feature/no-daemon-spark-daemon Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3881.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3881 commit 358400fb4f87f2e6de791a116bfd64c5a31f9d39 Author: Chris Heller hellert...@gmail.com Date: 2014-12-29T19:28:53Z Allow spark-daemon.sh to support foreground operation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68565593 @tgaloppo @mengxr What are your thoughts about doing the computation in log space as much as possible, and then exponentiating at the end? I'm mainly thinking about numerical stability, but I could imagine wanting to provide pdf() and logpdf() methods eventually. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Allow spark-daemon.sh to support foreground op...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3881#issuecomment-68565114 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3861#issuecomment-68582212 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25002/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3861#issuecomment-68582211 [Test build #25002 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25002/consoleFull) for PR 3861 at commit [`a8d036c`](https://github.com/apache/spark/commit/a8d036cf6ec4b8b1fa621a4da955f3274517e41f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` cd %s*; %s ./bin/spark-class .format(basename, prefixEnv)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4014] Add TaskContext.attemptNumber and...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3849#issuecomment-68581193 Maybe that test was flaky; let's see if it passes again (it'll retest since I pushed a commit to fix a merge conflict). I've updated this patch to not modify `attemptId` but to introduce `attemptNumber` and deprecate `attemptId`. I think it will be confusing to have `attemptId` have different behavior in different branches, especially since it seems like functionality that might be nice to rely on when writing certain types of regression tests. Since this patch doesn't change any behavior, I'd like to backport it to maintenance branches so that we can rely on it in test code. If we decide to do that, the committer should update the MiMa exclusions on cherry-pick. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-794][Core] Remove sleep() in ClusterSch...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3851#issuecomment-68581430 @tdas, what do you think about merging this? It looks like you were the last one to touch this line in 27311b13321ba60ee1324b86234f0aaf63df9f67. This fixed `sleep()` seems race-prone enough that I suppose we would have noticed if it was necessary for anything because it would have caused test flakiness. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4835] Disable validateOutputSpecs for S...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3832#issuecomment-68582506 [Test build #25003 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25003/consoleFull) for PR 3832 at commit [`6485cf8`](https://github.com/apache/spark/commit/6485cf880465cf7bd8e501dc861869be58029995). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4014] Add TaskContext.attemptNumber and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3849#issuecomment-68581165 [Test build #25004 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25004/consoleFull) for PR 3849 at commit [`eee6a45`](https://github.com/apache/spark/commit/eee6a4569d926d3ada2ca259ddb04906392688ae). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68581355 @jkbradley I used Octave's mvnpdf from the statistics package for the non-singular cases; it can not handle singular covariance matrices, so I was only able to recreate the function using octaves' pinv() function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4835] Disable validateOutputSpecs for S...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3832#issuecomment-68582508 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25003/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-794][Core] Remove sleep() in ClusterSch...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3851#issuecomment-68581466 I'm inclined to merge this into `master` now and not perform any backports right away (maybe it's still serving some purpose in older branches?). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5057]Add more details in log when using...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3875#issuecomment-68581252 I suppose it'd be nice to use string interpolation here, but I guess the old code didn't use it either, so this matches the surrounding style --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4631] unit test for MQTT
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3844#issuecomment-68525514 [Test build #24994 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24994/consoleFull) for PR 3844 at commit [`04503cf`](https://github.com/apache/spark/commit/04503cfa7f8168038c17198b6e45b16b89591e74). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class MQTTStreamSuite extends FunSuite with Eventually with BeforeAndAfter ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4631] unit test for MQTT
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3844#issuecomment-68525516 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24994/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Updated broken links
GitHub user sigmoidanalytics opened a pull request: https://github.com/apache/spark/pull/3877 Updated broken links Updated the broken link pointing to the KafkaWordCount example to the correct one. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sigmoidanalytics/spark patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3877.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3877 commit 3e19b31890f8317550c28b60edc3f5ea3137776c Author: sigmoidanalytics ma...@sigmoidanalytics.com Date: 2015-01-02T10:44:34Z Updated broken links Updated the broken link pointing to the KafkaWordCount example to the correct one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
GitHub user hxfeng opened a pull request: https://github.com/apache/spark/pull/3879 Merge pull request #1 from apache/master update You can merge this pull request into a Git repository by running: $ git pull https://github.com/hxfeng/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3879.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3879 commit b3ee640ffa59ed14fdeba61d5bf53b9b8e6cc520 Author: hxfeng 980548...@qq.com Date: 2014-12-28T03:51:35Z Merge pull request #1 from apache/master update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2165][YARN]add support for setting maxA...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3878#issuecomment-68525645 [Test build #24995 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24995/consoleFull) for PR 3878 at commit [`afdfc99`](https://github.com/apache/spark/commit/afdfc99e2722ac3a910de91dbf0c80972e7f7eb9). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4631] unit test for MQTT
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3844#issuecomment-68525682 [Test build #24996 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24996/consoleFull) for PR 3844 at commit [`4b34ee7`](https://github.com/apache/spark/commit/4b34ee784e7c9c489cf0c22d73311c160bc67c47). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2165][YARN]add support for setting maxA...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3878#issuecomment-68525649 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24995/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4631] unit test for MQTT
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3844#issuecomment-68525686 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24996/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Updated broken links
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3877#issuecomment-68519009 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user hxfeng commented on the pull request: https://github.com/apache/spark/pull/3879#issuecomment-68526458 update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5050] Add unit test for sqdist
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3869#discussion_r22424322 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala --- @@ -175,6 +175,42 @@ class VectorsSuite extends FunSuite { assert(v.size === x.rows) } + test(sqdist) { +val a = (30 to 0 by -1).map(math.pow(2.0, _)).toArray +val n = a.length +val v1 = Vectors.dense(a) +for (m - 0 until n) { + val indices = (0 to m).toArray + val values = indices.map(i = a(i)) + val v2 = Vectors.sparse(n, indices, values) + val v3 = Vectors.sparse(n, indices, indices.map(i = a(i) + 0.5)) + + // DenseVector vs. SparseVector + val squaredDist = breezeSquaredDistance(v1.toBreeze, v2.toBreeze) + val fastSquaredDist1 = Vectors.sqdist(v1, v2) + assert(fastSquaredDist1 == squaredDist) + + // DenseVector vs. DenseVector + val fastSquaredDist2 = Vectors.sqdist(v1, Vectors.dense(v2.toArray)) + assert(fastSquaredDist2 === squaredDist) + + // SparseVector vs. SparseVector + val squaredDist2 = breezeSquaredDistance(v2.toBreeze, v3.toBreeze) + val fastSquaredDist3 = Vectors.sqdist(v2, v3) + assert(fastSquaredDist3 === squaredDist2) + + // SparseVector vs. SparseVector: with values at different indices + if (m 10) { +val v4 = Vectors.sparse(n, indices.slice(0, m - 10), + indices.map(i = a(i) + 0.5).slice(0, m - 10)) +val squaredDist = breezeSquaredDistance(v2.toBreeze, v4.toBreeze) +val fastSquaredDist = --- End diff -- Can fit on 1 line instead of 2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5050] Add unit test for sqdist
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3869#discussion_r22424333 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala --- @@ -175,6 +175,42 @@ class VectorsSuite extends FunSuite { assert(v.size === x.rows) } + test(sqdist) { +val a = (30 to 0 by -1).map(math.pow(2.0, _)).toArray +val n = a.length +val v1 = Vectors.dense(a) +for (m - 0 until n) { + val indices = (0 to m).toArray + val values = indices.map(i = a(i)) + val v2 = Vectors.sparse(n, indices, values) + val v3 = Vectors.sparse(n, indices, indices.map(i = a(i) + 0.5)) + + // DenseVector vs. SparseVector + val squaredDist = breezeSquaredDistance(v1.toBreeze, v2.toBreeze) + val fastSquaredDist1 = Vectors.sqdist(v1, v2) + assert(fastSquaredDist1 == squaredDist) --- End diff -- ```==``` can be ```===``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5061][Alex Baretta] SQLContext: overloa...
GitHub user alexbaretta opened a pull request: https://github.com/apache/spark/pull/3882 [SPARK-5061][Alex Baretta] SQLContext: overload createParquetFile Overload of createParquetFile taking a StructType instead of a TypeTag You can merge this pull request into a Git repository by running: $ git pull https://github.com/alexbaretta/spark createParquetFile Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3882.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3882 commit f6e40b50c4aca9372c51d1337d559fc9cf50108d Author: Alex Baretta a...@planalechmy.com Date: 2014-12-27T02:29:29Z [Alex Baretta] SQLContext: overload createParquetFile Overload taking a StructType instead of TypeTag --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5050] Add unit test for sqdist
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3869#issuecomment-68566521 @viirya Looks OK to me, except for the tiny comments. Thanks! At some point, it might be nice to replace these tests with ones using random dense sparse vectors (with random sparsity patterns). If you are interested in doing that, I can send you a method for generating random sparse vectors which I used for the timing tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68566334 [Test build #24998 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24998/consoleFull) for PR 3871 at commit [`b4415ea`](https://github.com/apache/spark/commit/b4415ea70055e8ca2c0444cf964b696f0e1e410d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68565659 @tgaloppo The logic looks good; my comments are basically about clarity (except for the log space question). Thanks for the PR! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5017 [MLlib] - Use SVD to compute determ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3871#issuecomment-68565800 One more request: Could you please add a unit test with a singular matrix? Thank you! Perhaps in a new suite for MultivariateGaussian --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5061][Alex Baretta] SQLContext: overloa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3882#issuecomment-68567852 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org