[GitHub] spark pull request: [SPARK-12604] [CORE] Addendum - use casting vs...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10641#issuecomment-169831711 **[Test build #2351 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2351/consoleFull)** for PR 10641 at commit [`377fb49`](https://github.com/apache/spark/commit/377fb49a677f7f81699a7a9c05195cec9503af2b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10472#issuecomment-169836993 **[Test build #48977 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48977/consoleFull)** for PR 10472 at commit [`860861c`](https://github.com/apache/spark/commit/860861cb613a2d00a70e4eb699c25b2375c86eda). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12507][Streaming][Document]Expose close...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10453#issuecomment-169837010 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48980/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10652#issuecomment-169837014 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10472#issuecomment-169837133 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12507][Streaming][Document]Expose close...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10453#issuecomment-169837009 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10652#issuecomment-169837015 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48981/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49140391 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,116 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) --- End diff -- Ah yes it should be, I'll ad a docstring test for this method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10653#issuecomment-169841436 **[Test build #48986 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48986/consoleFull)** for PR 10653 at commit [`ade6f5d`](https://github.com/apache/spark/commit/ade6f5d354985f3778e0c8c2da80679c76495f0a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10652#issuecomment-169842106 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48985/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10635#issuecomment-169843491 **[Test build #48988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48988/consoleFull)** for PR 10635 at commit [`8bdd481`](https://github.com/apache/spark/commit/8bdd48189f96a45db54bc8d11e16107b0d15318f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10609#issuecomment-169833544 **[Test build #48979 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48979/consoleFull)** for PR 10609 at commit [`4e4e9a1`](https://github.com/apache/spark/commit/4e4e9a136ffae30665979df7307a6175188690f7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/10653 [SPARK-12700] [SQL] embed condition into SMJ and BroadcastHashJoin Currently SortMergeJoin and BroadcastHashJoin do not support condition, the need a followed Filter for that, the result projection to generate UnsafeRow could be very expensive if they generate lots of rows and could be filtered mostly by condition. This PR brings the support of condition for SortMergeJoin and BroadcastHashJoin, just like other outer joins do. This could improve the performance of Q72 by 7x (from 120s to 16.5s). You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark filter_join Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10653.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10653 commit a38d623d7d57709f2f26b1189ff699c02bd0ca57 Author: Davies LiuDate: 2016-01-07T23:05:50Z embed condition into SMJ and BroadcastHashJoin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10653#issuecomment-169838565 **[Test build #48984 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48984/consoleFull)** for PR 10653 at commit [`a38d623`](https://github.com/apache/spark/commit/a38d623d7d57709f2f26b1189ff699c02bd0ca57). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12510][Streaming]Refactor ActorReceiver...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/10457#issuecomment-169838489 @tdas forgot to merge? I'm merging it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11938][ML] Expose numFeatures in all ML...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/9936#discussion_r49143685 --- Diff: python/pyspark/ml/tests.py --- @@ -371,6 +378,103 @@ def test_fit_maximize_metric(self): self.assertEqual(1.0, bestModelMetric, "Best model has R-squared of 1") +class RegressorTest(PySparkTestCase): + +def setupData(self): +try: +self.df +except AttributeError: +from pyspark.mllib.linalg import Vectors +sqlContext = SQLContext(self.sc) +self.df = sqlContext.createDataFrame([ +(1.0, Vectors.dense(1.0)), +(0.0, Vectors.sparse(1, [], []))], ["label", "features"]) + +def test_linear_regression(self): +self.setupData() +lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal") +model = lr.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_decision_tree_regressor(self): +self.setupData() +dt = DecisionTreeRegressor(maxDepth=2) +model = dt.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_random_forest_regressor(self): +self.setupData() +rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42) +model = rf.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_gbt_regressor(self): +self.setupData() +gbt = GBTRegressor(maxIter=5, maxDepth=2) +model = gbt.fit(self.df) +self.assertEquals(1, model.numFeatures) + + +class ClassificationTest(PySparkTestCase): + +def setupData(self): +try: +self.df +except AttributeError: +from pyspark.mllib.linalg import Vectors --- End diff -- is there any reason for putting the import in the code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11938][ML] Expose numFeatures in all ML...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/9936#discussion_r49143733 --- Diff: python/pyspark/ml/tests.py --- @@ -371,6 +378,103 @@ def test_fit_maximize_metric(self): self.assertEqual(1.0, bestModelMetric, "Best model has R-squared of 1") +class RegressorTest(PySparkTestCase): + +def setupData(self): +try: +self.df +except AttributeError: +from pyspark.mllib.linalg import Vectors +sqlContext = SQLContext(self.sc) +self.df = sqlContext.createDataFrame([ +(1.0, Vectors.dense(1.0)), +(0.0, Vectors.sparse(1, [], []))], ["label", "features"]) + +def test_linear_regression(self): +self.setupData() +lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal") +model = lr.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_decision_tree_regressor(self): +self.setupData() +dt = DecisionTreeRegressor(maxDepth=2) +model = dt.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_random_forest_regressor(self): +self.setupData() +rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42) +model = rf.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_gbt_regressor(self): +self.setupData() +gbt = GBTRegressor(maxIter=5, maxDepth=2) +model = gbt.fit(self.df) +self.assertEquals(1, model.numFeatures) + + +class ClassificationTest(PySparkTestCase): + +def setupData(self): +try: +self.df +except AttributeError: +from pyspark.mllib.linalg import Vectors +sqlContext = SQLContext(self.sc) +self.df = sqlContext.createDataFrame([ +(1.0, Vectors.dense(1.0, 0.0)), +(0.0, Vectors.sparse(2, [1], [1.0]))], ["label", "features"]) + +def test_logistic_regression(self): +self.setupData() +lr = LogisticRegression(maxIter=5, regParam=0.01) +model = lr.fit(self.df) +self.assertEqual(2, model.numFeatures) + +def test_decision_tree_classifier(self): +from pyspark.ml.feature import StringIndexer --- End diff -- same thing here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9835] [ML] IterativelyReweightedLeastSq...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/10639#discussion_r49144625 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/GLMFamilies.scala --- @@ -0,0 +1,123 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.optim + +import org.apache.spark.rdd.RDD + +/** + * A description of the error distribution and link function to be used in the model. + * @param link a link function instance + */ +private[ml] abstract class Family(val link: Link) extends Serializable { + + /** + * Starting value for mu in the IRLS algorithm. + */ + def startingMu(y: Double, yMean: Double): Double = (y + yMean) / 2.0 + + /** + * Deviance of (y, mu) pair. + * Deviance is usually defined as twice the loglikelihood ratio. + */ + def deviance(y: RDD[Double], mu: RDD[Double]): Double + + /** Weights for IRLS steps. */ + def weights(mu: Double): Double + + /** The working dependent variable. */ + def z(y: Double, mu: Double, eta: Double): Double +} + +/** + * Binomial exponential family distribution. + * The default link for the Binomial family is the logit link. + * @param link a link function instance + */ +private[ml] class Binomial(link: Link = new Logit) extends Family(link) { + + override def startingMu(y: Double, yMean: Double): Double = (y + 0.5) / 2.0 + + override def deviance(y: RDD[Double], mu: RDD[Double]): Double = { +mu.zip(y).map { case (mu, y) => + val my = 1.0 - y + y * math.log(math.max(y, 1.0) / mu) + +my * math.log(math.max(my, 1.0) / (1.0 - mu)) +}.sum() * 2 + } + + override def weights(mu: Double): Double = { +mu * (1 - mu) + } + + override def z(y: Double, mu: Double, eta: Double): Double = { +eta + (y - mu) * link.deriv(mu) + } +} + +/** + * Poisson exponential family. + * The default link for the Poisson family is the log link. + * @param link a link function instance + */ +private[ml] class Poisson(link: Link = new Logit) extends Family(link) { --- End diff -- I believe the link function here should default to `Log` not `Logit` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10650#issuecomment-169847124 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48989/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10650#issuecomment-169847116 **[Test build #48989 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48989/consoleFull)** for PR 10650 at commit [`87fc0ff`](https://github.com/apache/spark/commit/87fc0ffb67e6538b2b850e0fd36ba6e2c63fc549). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10650#issuecomment-169847122 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10615#issuecomment-169850361 **[Test build #48975 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48975/consoleFull)** for PR 10615 at commit [`319e0ed`](https://github.com/apache/spark/commit/319e0edb17d02eb994bc1cd104a29df8c47a9c59). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10238#issuecomment-169851002 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48964/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10238#issuecomment-169851001 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/10615#discussion_r49147677 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -0,0 +1,341 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import java.nio.charset.UnsupportedCharsetException +import java.io.File +import java.sql.Timestamp + +import org.apache.spark.SparkException +import org.apache.spark.sql.{DataFrame, QueryTest, Row} +import org.apache.spark.sql.test.{SQLTestUtils, SharedSQLContext} +import org.apache.spark.sql.types._ + +class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { + private val carsFile = "cars.csv" + private val carsFile8859 = "cars_iso-8859-1.csv" + private val carsTsvFile = "cars.tsv" + private val carsAltFile = "cars-alternative.csv" + private val carsUnbalancedQuotesFile = "cars-unbalanced-quotes.csv" + private val carsNullFile = "cars-null.csv" + private val emptyFile = "empty.csv" + private val commentsFile = "comments.csv" + private val disableCommentsFile = "disable_comments.csv" + + private def testFile(fileName: String): String = { + Thread.currentThread().getContextClassLoader.getResource(fileName).toString + } + + /** Verifies data and schema. */ + private def verifyCars( + df: DataFrame, + withHeader: Boolean, + numCars: Int = 3, + numFields: Int = 5, + checkHeader: Boolean = true, + checkValues: Boolean = true, + checkTypes: Boolean = false): Unit = { + +val numColumns = numFields +val numRows = if (withHeader) numCars else numCars + 1 +// schema +assert(df.schema.fieldNames.length === numColumns) +assert(df.collect().length === numRows) + +if (checkHeader) { + if (withHeader) { +assert(df.schema.fieldNames === Array("year", "make", "model", "comment", "blank")) + } else { +assert(df.schema.fieldNames === Array("C0", "C1", "C2", "C3", "C4")) + } +} + +if (checkValues) { + val yearValues = List("2012", "1997", "2015") + val actualYears = if (!withHeader) "year" :: yearValues else yearValues + val years = if (withHeader) df.select("year").collect() else df.select("C0").collect() + + years.zipWithIndex.foreach { case (year, index) => +if (checkTypes) { + assert(year === Row(actualYears(index).toInt)) +} else { + assert(year === Row(actualYears(index))) +} + } +} + } + + test("simple csv test") { +val cars = sqlContext + .read + .format("csv") + .option("header", "false") + .load(testFile(carsFile)) + +verifyCars(cars, withHeader = false, checkTypes = false) + } + + test("simple csv test with type inference") { +val cars = sqlContext + .read + .format("csv") + .option("header", "true") + .option("inferSchema", "true") + .load(testFile(carsFile)) + +verifyCars(cars, withHeader = true, checkTypes = true) + } + + test("test with alternative delimiter and quote") { +val cars = sqlContext.read + .format("csv") + .options(Map("quote" -> "\'", "delimiter" -> "|", "header" -> "true")) + .load(testFile(carsAltFile)) + +verifyCars(cars, withHeader = true) + } + + test("bad encoding name") { +val exception = intercept[UnsupportedCharsetException] { + sqlContext +.read +.format("csv") +.option("charset", "1-9588-osi") +.load(testFile(carsFile8859)) +} + +assert(exception.getMessage.contains("1-9588-osi")) + } + + ignore("test different encoding") { +
[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/10615#discussion_r49147704 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -0,0 +1,341 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import java.nio.charset.UnsupportedCharsetException +import java.io.File +import java.sql.Timestamp + +import org.apache.spark.SparkException +import org.apache.spark.sql.{DataFrame, QueryTest, Row} +import org.apache.spark.sql.test.{SQLTestUtils, SharedSQLContext} +import org.apache.spark.sql.types._ + +class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { + private val carsFile = "cars.csv" + private val carsFile8859 = "cars_iso-8859-1.csv" + private val carsTsvFile = "cars.tsv" + private val carsAltFile = "cars-alternative.csv" + private val carsUnbalancedQuotesFile = "cars-unbalanced-quotes.csv" + private val carsNullFile = "cars-null.csv" + private val emptyFile = "empty.csv" + private val commentsFile = "comments.csv" + private val disableCommentsFile = "disable_comments.csv" + + private def testFile(fileName: String): String = { + Thread.currentThread().getContextClassLoader.getResource(fileName).toString + } + + /** Verifies data and schema. */ + private def verifyCars( + df: DataFrame, + withHeader: Boolean, + numCars: Int = 3, + numFields: Int = 5, + checkHeader: Boolean = true, + checkValues: Boolean = true, + checkTypes: Boolean = false): Unit = { + +val numColumns = numFields +val numRows = if (withHeader) numCars else numCars + 1 +// schema +assert(df.schema.fieldNames.length === numColumns) +assert(df.collect().length === numRows) + +if (checkHeader) { + if (withHeader) { +assert(df.schema.fieldNames === Array("year", "make", "model", "comment", "blank")) + } else { +assert(df.schema.fieldNames === Array("C0", "C1", "C2", "C3", "C4")) + } +} + +if (checkValues) { + val yearValues = List("2012", "1997", "2015") + val actualYears = if (!withHeader) "year" :: yearValues else yearValues + val years = if (withHeader) df.select("year").collect() else df.select("C0").collect() + + years.zipWithIndex.foreach { case (year, index) => +if (checkTypes) { + assert(year === Row(actualYears(index).toInt)) +} else { + assert(year === Row(actualYears(index))) +} + } +} + } + + test("simple csv test") { +val cars = sqlContext + .read + .format("csv") + .option("header", "false") + .load(testFile(carsFile)) + +verifyCars(cars, withHeader = false, checkTypes = false) + } + + test("simple csv test with type inference") { +val cars = sqlContext + .read + .format("csv") + .option("header", "true") + .option("inferSchema", "true") + .load(testFile(carsFile)) + +verifyCars(cars, withHeader = true, checkTypes = true) + } + + test("test with alternative delimiter and quote") { +val cars = sqlContext.read + .format("csv") + .options(Map("quote" -> "\'", "delimiter" -> "|", "header" -> "true")) + .load(testFile(carsAltFile)) + +verifyCars(cars, withHeader = true) + } + + test("bad encoding name") { +val exception = intercept[UnsupportedCharsetException] { + sqlContext +.read +.format("csv") +.option("charset", "1-9588-osi") +.load(testFile(carsFile8859)) +} + +assert(exception.getMessage.contains("1-9588-osi")) + } + + ignore("test different encoding") { +
[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/10615#discussion_r49147610 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala --- @@ -0,0 +1,243 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import java.io.{OutputStreamWriter, ByteArrayOutputStream, StringReader} + +import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, CsvParser, CsvWriter} + +import org.apache.spark.Logging + +/** + * Read and parse CSV-like input + * + * @param params Parameters object + * @param headers headers for the columns + */ +private[sql] abstract class CsvReader(params: CSVParameters, headers: Seq[String]) { + + protected lazy val parser: CsvParser = { +val settings = new CsvParserSettings() +val format = settings.getFormat +format.setDelimiter(params.delimiter) +format.setLineSeparator(params.rowSeparator) +format.setQuote(params.quote) +format.setQuoteEscape(params.escape) +format.setComment(params.comment) + settings.setIgnoreLeadingWhitespaces(params.ignoreLeadingWhiteSpaceFlag) + settings.setIgnoreTrailingWhitespaces(params.ignoreTrailingWhiteSpaceFlag) +settings.setReadInputOnSeparateThread(false) +settings.setInputBufferSize(params.inputBufferSize) +settings.setMaxColumns(params.maxColumns) +settings.setNullValue(params.nullValue) +settings.setMaxCharsPerColumn(params.maxCharsPerColumn) +if (headers != null) settings.setHeaders(headers: _*) + +new CsvParser(settings) + } +} + +/** + * Converts a sequence of string to CSV string + * + * @param params Parameters object for configuration + * @param headers headers for columns + */ +private[sql] class LineCsvWriter(params: CSVParameters, headers: Seq[String]) extends Logging { + private val writerSettings = new CsvWriterSettings + private val format = writerSettings.getFormat + + format.setDelimiter(params.delimiter) + format.setLineSeparator(params.rowSeparator) + format.setQuote(params.quote) + format.setQuoteEscape(params.escape) + format.setComment(params.comment) + + writerSettings.setNullValue(params.nullValue) + writerSettings.setEmptyValue(params.nullValue) + writerSettings.setSkipEmptyLines(true) + writerSettings.setQuoteAllFields(false) + writerSettings.setHeaders(headers: _*) + + def writeRow(row: Seq[String], includeHeader: Boolean): String = { +val buffer = new ByteArrayOutputStream() +val outputWriter = new OutputStreamWriter(buffer) +val writer = new CsvWriter(outputWriter, writerSettings) + +if (includeHeader) { + writer.writeHeaders() +} +writer.writeRow(row.toArray: _*) +writer.close() +buffer.toString.stripLineEnd + } +} + +/** + * Parser for parsing a line at a time. Not efficient for bulk data. + * + * @param params Parameters object + */ +private[sql] class LineCsvReader(params: CSVParameters) + extends CsvReader(params, null) { + /** +* parse a line +* +* @param line a String with no newline at the end +* @return array of strings where each string is a field in the CSV record +*/ + def parseLine(line: String): Array[String] = { +parser.beginParsing(new StringReader(line)) +val parsed = parser.parseNext() +parser.stopParsing() +parsed + } +} + +/** + * Parser for parsing lines in bulk. Use this when efficiency is desired. + * + * @param iter iterator over lines in the file + * @param params Parameters object + * @param headers headers for the columns + */ +private[sql] class BulkCsvReader( +iter: Iterator[String], +params: CSVParameters,
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49149092 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,116 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) --- End diff -- Ah seems that the JavaModelWraper call method being used won't work on the workers. I'll have to port the predict method over. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12688][SQL] Fix spill size metric in un...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10634#issuecomment-169857794 **[Test build #48992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48992/consoleFull)** for PR 10634 at commit [`416d73d`](https://github.com/apache/spark/commit/416d73d954155ebff8f5f75c99cbfc61a24ad818). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10650#issuecomment-169857832 **[Test build #48991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48991/consoleFull)** for PR 10650 at commit [`87fc0ff`](https://github.com/apache/spark/commit/87fc0ffb67e6538b2b850e0fd36ba6e2c63fc549). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/10652#issuecomment-169861623 jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10635#issuecomment-169861548 Does this actually let you use one source to compile against both versions of Spark? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12507][Streaming][Document]Expose close...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10453#issuecomment-16984 **[Test build #48980 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48980/consoleFull)** for PR 10453 at commit [`28a750d`](https://github.com/apache/spark/commit/28a750d61c058e537a8ca44babb3ff0f4b54f3b3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...
Github user iyounus commented on a diff in the pull request: https://github.com/apache/spark/pull/10274#discussion_r49140607 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -94,8 +110,7 @@ private[ml] class WeightedLeastSquares( if (standardizeFeatures) { lambda *= aVar(j - 2) } - if (standardizeLabel) { -// TODO: handle the case when bStd = 0 + if (standardizeLabel && bStd != 0) { --- End diff -- @dbtsai The problem here is that for regularized regression in R, I need to use `glmnet`. But for this specific case (constant label, no intercept and no regularization) the results from `glmnet` do no match with `lm`. So I see a discrepancy within R itself. Have a look at the following R code: ``` A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2) b <- c(17, 17, 17, 17) w <- c(1, 2, 3, 4) df <- as.data.frame(cbind(A, b)) lm.model <- lm(b ~ . -1, data=df, weights=w) print(as.vector(coef(lm.model))) [1] -9.221298 3.394343 glm.model <- glmnet(A, b, weights=w, intercept=FALSE, lambda=0, standardize=FALSE, alpha=0, thresh=1E-14) print(as.vector(coef(glm.model))) [1] 0 0 0 ``` Note that in this example, I expect same results from both `lm` and `glmnet` because I've set `lambda=0` in `glmnet`. (BTW `standardize` has not effect here.) It seems to me that `glmnet` just sets all coefficients to zero if label is constant and intercept is not included. This is true even if I include regularization. Right now `WeightedLeastSquares` (without regularization) matches with `lm`, and I think this is the correct behaviour given my understanding of the normal equation. With regularization, it should still give some non-zero coefficients, which is does. I don't know why `glmnet` behaves differently, but I don't think we should try to match that in this particular case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11923][ML] Python API for ml.feature.Ch...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10186#issuecomment-169840549 LGTM cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10653#issuecomment-169842587 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48984/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10873] Support column sort and search f...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10648#issuecomment-169842729 **[Test build #48976 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48976/consoleFull)** for PR 10648 at commit [`4322851`](https://github.com/apache/spark/commit/4322851fa7a253e7422c8f910d96a0f99a3728cd). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...
Github user maropu commented on the pull request: https://github.com/apache/spark/pull/10635#issuecomment-169842632 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12576][SQL] Enable expression parsing i...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10649#issuecomment-169846366 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48983/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12576][SQL] Enable expression parsing i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10649#issuecomment-169846245 **[Test build #48983 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48983/consoleFull)** for PR 10649 at commit [`c2b35b7`](https://github.com/apache/spark/commit/c2b35b7efdd80ab4930b46a437bb9289c87b5206). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12576][SQL] Enable expression parsing i...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10649#issuecomment-169846362 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/10615#issuecomment-169855191 Cool! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10653#issuecomment-169859848 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12604] [CORE] Addendum - use casting vs...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10641#issuecomment-169859912 Alright merging this. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10653#issuecomment-169859850 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48986/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10651#issuecomment-169836663 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48982/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12507][Streaming][Document]Expose close...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10453#issuecomment-169836865 **[Test build #48980 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48980/consoleFull)** for PR 10453 at commit [`28a750d`](https://github.com/apache/spark/commit/28a750d61c058e537a8ca44babb3ff0f4b54f3b3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10651#issuecomment-169836657 **[Test build #48982 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48982/consoleFull)** for PR 10651 at commit [`9582e49`](https://github.com/apache/spark/commit/9582e49a5a5a5de2aed3c56adbd6ec54651115b4). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49140159 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,116 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) --- End diff -- Also, maybe you can add a test for this case in the docstring. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49140117 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,116 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) --- End diff -- I am not sure I understand this line, shouldn't it be `x.map(self.predict)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12638] [API DOC] Parameter explanation ...
Github user Wenpei commented on the pull request: https://github.com/apache/spark/pull/10587#issuecomment-169839655 @srowen it pass test now. ready for merge. Thanks for review. Wenpei --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12510][Streaming]Refactor ActorReceiver...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10457 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...
Github user ehsanmok commented on the pull request: https://github.com/apache/spark/pull/9916#issuecomment-169842052 @mengxr @srowen @jkbradley Why reviewing this simple thing which is important for my application is taking so long? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10652#issuecomment-169842103 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10650#issuecomment-169844423 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9835] [ML] IterativelyReweightedLeastSq...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/10639#discussion_r49144794 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala --- @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.optim + +import org.apache.spark.Logging +import org.apache.spark.ml.feature.Instance +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.linalg.BLAS._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * Model fitted by [[IterativelyReweightedLeastSquares]]. + * @param coefficients model coefficients + * @param intercept model intercept + */ +private[ml] class IterativelyReweightedLeastSquaresModel( +val coefficients: DenseVector, +val intercept: Double) extends Serializable + +/** + * Fits a generalized linear model (GLM) for a given family using + * iteratively reweighted least squares (IRLS). + */ +private[ml] class IterativelyReweightedLeastSquares( +val family: Family, +val fitIntercept: Boolean, +val regParam: Double, +val standardizeFeatures: Boolean, +val standardizeLabel: Boolean, +val maxIter: Int, +val tol: Double) extends Logging with Serializable { + + def fit(instances: RDD[Instance]): IterativelyReweightedLeastSquaresModel = { + +val y = instances.map(_.label).persist(StorageLevel.MEMORY_AND_DISK) +val yMean = y.reduce(_ + _) / y.count() +var mu = y.map { yi => family.startingMu(yi, yMean) } +var eta = mu.map { mu => family.link.link(mu) } --- End diff -- Pre-computing `eta` here seems unnecessary since it is re-assigned within the while loop before it is used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12510][Streaming]Refactor ActorReceiver...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/10457#issuecomment-169855024 I was having trouble with setting up the Exceeded Github API rate limit. Thanks for merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10650#issuecomment-169854996 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10609#issuecomment-169855045 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48979/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10609#issuecomment-169855043 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10609#issuecomment-169854898 **[Test build #48979 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48979/consoleFull)** for PR 10609 at commit [`4e4e9a1`](https://github.com/apache/spark/commit/4e4e9a136ffae30665979df7307a6175188690f7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10653#issuecomment-169859598 **[Test build #48986 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48986/consoleFull)** for PR 10653 at commit [`ade6f5d`](https://github.com/apache/spark/commit/ade6f5d354985f3778e0c8c2da80679c76495f0a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12692][BUILD] Scala style: check no whi...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/10643#issuecomment-169861163 warnings are displayed like as follows. ``` [warn] /home/sarutak/work/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/NumberConverter.scala:125:29: Space before token , [warn] /home/sarutak/work/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:52:20: Space before token : [warn] /home/sarutak/work/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:119:23: Space before token : [warn] /home/sarutak/work/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:389:22: Space before token : [warn] /home/sarutak/work/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala:206:39: Space before token , ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10635#issuecomment-169861160 **[Test build #48988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48988/consoleFull)** for PR 10635 at commit [`8bdd481`](https://github.com/apache/spark/commit/8bdd48189f96a45db54bc8d11e16107b0d15318f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/10651#issuecomment-169833274 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10472#issuecomment-169837135 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48977/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12576][SQL] Enable expression parsing i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10649#issuecomment-169838962 **[Test build #48983 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48983/consoleFull)** for PR 10649 at commit [`c2b35b7`](https://github.com/apache/spark/commit/c2b35b7efdd80ab4930b46a437bb9289c87b5206). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10873] Support column sort and search f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10648#issuecomment-169842836 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48976/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10873] Support column sort and search f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10648#issuecomment-169842834 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10650#issuecomment-169845842 **[Test build #48989 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48989/consoleFull)** for PR 10650 at commit [`87fc0ff`](https://github.com/apache/spark/commit/87fc0ffb67e6538b2b850e0fd36ba6e2c63fc549). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user gracew commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r49144128 --- Diff: python/pyspark/__init__.py --- @@ -36,6 +36,53 @@ Finer-grained cache persistence levels. """ +import os +import re +import sys + +from os.path import isfile, join + +import xml.etree.ElementTree as ET + +if os.environ.get("SPARK_HOME") is None: +raise ImportError("Environment variable SPARK_HOME is undefined.") + +spark_home = os.environ['SPARK_HOME'] +pom_xml_file_path = join(spark_home, 'pom.xml') +snapshot_version = None + +if isfile(pom_xml_file_path): +try: +tree = ET.parse(pom_xml_file_path) +root = tree.getroot() +version_tag = root[4].text +snapshot_version = version_tag[:5] +except: +raise ImportError("Could not read the spark version, because pom.xml file" + + " could not be read.") +else: +try: +lib_file_path = join(spark_home, "lib") --- End diff -- @alope107 , would you mind updating this PR to remove the pom_xml_file_path branch? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9835] [ML] IterativelyReweightedLeastSq...
Github user sethah commented on the pull request: https://github.com/apache/spark/pull/10639#issuecomment-169848523 @yanboliang Could you post a link to a reference paper? I find documentation on IRLS scattered, so it would be nice to have something concrete to point to. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12701] [CORE] FileAppender should use j...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10654#issuecomment-169852147 **[Test build #48990 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48990/consoleFull)** for PR 10654 at commit [`d937d09`](https://github.com/apache/spark/commit/d937d09f3f5aab96361cee93d0a376c25c72). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12604] [CORE] Addendum - use casting vs...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10641#issuecomment-169853728 **[Test build #2351 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2351/consoleFull)** for PR 10641 at commit [`377fb49`](https://github.com/apache/spark/commit/377fb49a677f7f81699a7a9c05195cec9503af2b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10635#issuecomment-169861338 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48988/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add type aliases backwards ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10635#issuecomment-169861337 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9819][Streaming][Documentation] Clarify...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/8103#issuecomment-169832635 Sorry i forgot about this PR completely. Just one more nit that i commented on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10651#issuecomment-169832596 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9819][Streaming][Documentation] Clarify...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/8103#discussion_r49138016 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala --- @@ -336,7 +336,8 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, V)])( * However, it is applicable to only "invertible reduce functions". * Hash partitioning is used to generate the RDDs with Spark's default number of partitions. * @param reduceFunc associative reduce function - * @param invReduceFunc inverse function + * @param invReduceFunc inverse function; such that for all x, invertible y: + * `invReduceFunc(reduceFunc(x, y), y) = x` --- End diff -- Why not reduceFunc("x", "y") = "xy" ... y is always added to right inverseReduceFunc("xy", "x") = "y"... x is always removed from left --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/10652 [SPARK-12699][SPARKR] R driver process should start in a clean state Currently we have R worker process launched with the --vanilla option that brings it up in a clean state (without init profile or workspace data, https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html). However, the R process for the Spark driver is not. We should do that because 1. That would make driver consistent with the worker process in R - for instance, a library would not be load in driver but not worker 2. Since SparkR depends on .libPath and .First() it could be broken by something in the user workspace, for example Here are the changes proposed: 1. When starting `sparkR` shell (except: allow save/restore workspace, since the driver/shell is local) 2. When launching R driver in cluster mode 3. In cluster mode, when calling R to install shipped R package This is discussed in PR #10171 @shivaram @sun-rui You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rvanilla Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10652.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10652 commit c3488c9eda1f731c24769f20eb570d97e4aa5939 Author: felixcheungDate: 2016-01-07T09:13:54Z add R command line options commit 24fee57e42beec3315979b8db4d817474bcd4baa Author: felixcheung Date: 2016-01-07T22:40:50Z allow save/restore user workspace when running shell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10651#issuecomment-169832598 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48978/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10651#issuecomment-169836393 **[Test build #48982 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48982/consoleFull)** for PR 10651 at commit [`9582e49`](https://github.com/apache/spark/commit/9582e49a5a5a5de2aed3c56adbd6ec54651115b4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10570#discussion_r49139498 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala --- @@ -57,8 +57,8 @@ trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks { (result, expected) match { case (result: Array[Byte], expected: Array[Byte]) => java.util.Arrays.equals(result, expected) - case (result: Double, expected: Spread[Double]) => -expected.isWithin(result) + case (result: Double, expected: Spread[_]) => // Can't use Spread[Double] b/c of erasure --- End diff -- I see. Sadly, I think this is not going to work here without extra work, and then it is not going to do what you want. This version of scalatest uses manifest to encode type information, and you would have to define it manually in this context: ```scala implicit val x: Manifest[Int] = ??? stream shouldBe a [ReceiverInputDStream[Int @unchecked]] ``` but then the scalatest library is not aware of the `unchecked` annotation, and still throws a warning. Let's just have `_` in the suite file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12654] sc.wholeTextFiles with spark.had...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10651#issuecomment-169836662 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10602#discussion_r49140273 --- Diff: python/pyspark/mllib/fpm.py --- @@ -130,15 +133,21 @@ def train(cls, data, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=320 """ Finds the complete set of frequent sequential patterns in the input sequences of itemsets. -:param data: The input data set, each element contains a sequnce of itemsets. -:param minSupport: the minimal support level of the sequential pattern, any pattern appears -more than (minSupport * size-of-the-dataset) times will be output (default: `0.1`) -:param maxPatternLength: the maximal length of the sequential pattern, any pattern appears -less than maxPatternLength will be output. (default: `10`) -:param maxLocalProjDBSize: The maximum number of items (including delimiters used in -the internal storage format) allowed in a projected database before local -processing. If a projected database exceeds this size, another -iteration of distributed prefix growth is run. (default: `3200`) +:param data: + The input data set, each element contains a sequnce of itemsets. +:param minSupport: + The minimal support level of the sequential pattern, any pattern appears more than --- End diff -- the lines below have indentation issues --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10602#discussion_r49140295 --- Diff: python/pyspark/mllib/recommendation.py --- @@ -239,6 +239,17 @@ def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative product of two lower-rank matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of ALS. This is done using a level of parallelism given by `blocks`. + + :param iterations: --- End diff -- indentation issues? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12699][SPARKR] R driver process should ...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/10652#issuecomment-169839126 jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/10609#issuecomment-169840672 By the way, I will send another PR for branch 1.6 due to the conflicts of MimaExcludes.scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10653#issuecomment-169842586 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12700] [SQL] embed condition into SMJ a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10653#issuecomment-169842552 **[Test build #48984 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48984/consoleFull)** for PR 10653 at commit [`a38d623`](https://github.com/apache/spark/commit/a38d623d7d57709f2f26b1189ff699c02bd0ca57). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10609#issuecomment-169843683 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10609#issuecomment-169843686 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48987/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12591][Streaming]Register OpenHashMapBa...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/10609#issuecomment-169843942 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12701] [CORE] FileAppender should use j...
GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/10654 [SPARK-12701] [CORE] FileAppender should use join to ensure writing thread completion Changed Logging FileAppender to use join in `awaitTermination` to ensure that thread is properly finished before returning. You can merge this pull request into a Git repository by running: $ git pull https://github.com/BryanCutler/spark fileAppender-join-thread-SPARK-12701 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10654.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10654 commit d937d09f3f5aab96361cee93d0a376c25c72 Author: Bryan CutlerDate: 2016-01-08T00:19:47Z [SPARK-12701] Changed FileAppender to use join to sync thread completion instead of wait/notifyAll --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10238#issuecomment-169850939 **[Test build #48964 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48964/consoleFull)** for PR 10238 at commit [`123d958`](https://github.com/apache/spark/commit/123d958ba05a36aebb2548f04418153979d243ed). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10615#issuecomment-169850601 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48975/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10615#issuecomment-169850598 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12420][SQL] Have a built-in CSV data so...
Github user mohitjaggi commented on the pull request: https://github.com/apache/spark/pull/10615#issuecomment-169852430 this is great...thanks @falaki --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11938][ML] Expose numFeatures in all ML...
Github user Lewuathe commented on a diff in the pull request: https://github.com/apache/spark/pull/9936#discussion_r49147215 --- Diff: python/pyspark/ml/tests.py --- @@ -371,6 +378,103 @@ def test_fit_maximize_metric(self): self.assertEqual(1.0, bestModelMetric, "Best model has R-squared of 1") +class RegressorTest(PySparkTestCase): + +def setupData(self): +try: +self.df +except AttributeError: +from pyspark.mllib.linalg import Vectors +sqlContext = SQLContext(self.sc) +self.df = sqlContext.createDataFrame([ +(1.0, Vectors.dense(1.0)), +(0.0, Vectors.sparse(1, [], []))], ["label", "features"]) + +def test_linear_regression(self): +self.setupData() +lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal") +model = lr.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_decision_tree_regressor(self): +self.setupData() +dt = DecisionTreeRegressor(maxDepth=2) +model = dt.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_random_forest_regressor(self): +self.setupData() +rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42) +model = rf.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_gbt_regressor(self): +self.setupData() +gbt = GBTRegressor(maxIter=5, maxDepth=2) +model = gbt.fit(self.df) +self.assertEquals(1, model.numFeatures) + + +class ClassificationTest(PySparkTestCase): + +def setupData(self): +try: +self.df +except AttributeError: +from pyspark.mllib.linalg import Vectors --- End diff -- `Vectors` and `StringIndexer` is not used in any other place. It is better not to expand the scope in my though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12604] [CORE] Addendum - use casting vs...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10641 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org