[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16308 **[Test build #70396 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70396/testReport)** for PR 16308 at commit [`4b6900c`](https://github.com/apache/spark/commit/4b6900cf6d182d87a545d736d320c6229fb8251d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16347 Thanks for submitting the ticket. In general I don't think the sortWithinPartitions property can carry over to writing out data, because one partition actually corresponds to more than one file. Can your use case be satisfied by adding an explicit sortBy? ``` df.write.sortBy(col).parquet(...) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16308 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16308 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70396/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16349: [Doc] bucketing is applicable to all file-based d...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/16349 [Doc] bucketing is applicable to all file-based data sources ## What changes were proposed in this pull request? Starting Spark 2.1.0, bucketing feature is available for all file-based data sources. This patch fixes some function docs that haven't yet been updated to reflect that. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark ds-doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16349.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16349 commit c8f1b42ec15af791de36a3e4311de424d2dd99de Author: Reynold Xin Date: 2016-12-20T08:02:48Z [Doc] bucketing is applicable to all file-based data sources --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16296 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70398/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12775 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70397/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12775 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16296 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16349: [Doc] bucketing is applicable to all file-based data sou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16349 **[Test build #70399 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70399/testReport)** for PR 16349 at commit [`c8f1b42`](https://github.com/apache/spark/commit/c8f1b42ec15af791de36a3e4311de424d2dd99de). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15018 For the zero-weight values, can we do similar to scikit-learn to remove zero-weight values, like https://github.com/amueller/scikit-learn/commit/2415100f79293bbbf52c12c36d63a6cf602cf3c4 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16232: [SPARK-18800][SQL] Fix UnsafeKVExternalSorter by correct...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16232 **[Test build #70400 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70400/testReport)** for PR 16232 at commit [`e70692d`](https://github.com/apache/spark/commit/e70692dd060ee137842e1fd16e49826967114060). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...
Github user junegunn commented on the issue: https://github.com/apache/spark/pull/16347 Thanks for the comment. I was trying to implement the following Hive QL in Spark SQL/API: ```sql set hive.exec.dynamic.partition.mode=nonstrict; set hive.mapred.mode = nonstrict; insert overwrite table target_table partition (day) select * from source_table distribute by day sort by id; ``` In Hive, `distribute by day` ensures that the records with the same "day" goes to the same reducer, and `sort by id` ensures that the input to each reducer is sorted by "id". It works as expected. The number of reducers is no more than the cardinality of "day" column, and I could confirm that the generated ORC file in each partition is sorted by "id". However, if I run the same query or its equivalent Spark code â [`repartition('day)` for `distribute by day`](https://github.com/apache/spark/blob/bfeccd80ef032cab3525037be3d3e42519619493/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2423), and [`sortWithinPartitions('id)` for `sort by id`](https://github.com/apache/spark/blob/bfeccd80ef032cab3525037be3d3e42519619493/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L990) â on Spark, we have the right number of writer tasks, one for each partition, and each task generates a single output file, but the generated ORC file is not properly sorted by "id" making ORC index ineffective. > Can your use case be satisfied by adding an explicit sortBy? `sortBy` is for bucketed tables and requires `bucketBy`, so I'm not sure if it's related to this issue regarding Hive compatibility. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16232 **[Test build #70401 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70401/testReport)** for PR 16232 at commit [`5a31e37`](https://github.com/apache/spark/commit/5a31e378e3de301dad768eee776dcb88a404). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...
Github user lirui-intel commented on the issue: https://github.com/apache/spark/pull/12775 The new test passed locally and I can't find any failures in the Jenkins test report. Not sure what failed exactly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16336: [SPARK-18923][DOC][BUILD] Support skipping R/Python API ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16336 I think it's fine to make this change for consistency and convenience. It's minor. It'd be nice to document them in the README.md, briefly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16349: [Doc] bucketing is applicable to all file-based data sou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16349 **[Test build #70399 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70399/testReport)** for PR 16349 at commit [`c8f1b42`](https://github.com/apache/spark/commit/c8f1b42ec15af791de36a3e4311de424d2dd99de). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16232 **[Test build #70400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70400/testReport)** for PR 16232 at commit [`e70692d`](https://github.com/apache/spark/commit/e70692dd060ee137842e1fd16e49826967114060). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16232 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70400/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16232 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16232 **[Test build #70401 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70401/testReport)** for PR 16232 at commit [`5a31e37`](https://github.com/apache/spark/commit/5a31e378e3de301dad768eee776dcb88a404). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock ...
GitHub user xuanyuanking opened a pull request: https://github.com/apache/spark/pull/16350 [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for each table's relation in cache ## What changes were proposed in this pull request? Backport of #16135 to branch-2.0 ## How was this patch tested? Because of the diff between branch-2.0 and master/2.1, here add a multi-thread access table test in `HiveMetadataCacheSuite` and check it only loading once using metrics in `HiveCatalogMetrics` You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuanyuanking/spark SPARK-18700-2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16350.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16350 commit 132d12ee1457c41a0bec56516ab5a41d36d8ac1f Author: xuanyuanking Date: 2016-12-20T10:50:03Z SPARK-18700: Add StripedLock for each table's relation in cache --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16232 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16232 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70401/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16350 **[Test build #70402 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70402/consoleFull)** for PR 16350 at commit [`132d12e`](https://github.com/apache/spark/commit/132d12ee1457c41a0bec56516ab5a41d36d8ac1f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16135: [SPARK-18700][SQL] Add StripedLock for each table's rela...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/16135 @hvanhovell Sure, I open a new BACKPORT-2.0. There's a little diff in branch-2.0, the ut test of this patch based on the `HiveCatalogMetrics` which not added in 2.0, so I added the patch need metric. Thanks for check. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93215941 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 + +override def initialize(y: Double, weight: Double): Double = { + if (variancePower > 1.0 && variancePower < 2.0) { +require(y >= 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +math.max(y, 0.1) + } else { +require(y > 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +y + } +} + +override def variance(mu: Double): Double = math.pow(mu, variancePower) + +private def yp(y: Double, mu: Double, p: Double): Double = { + (math.pow(y, p) - math.pow(mu, p)) / p +} + +// Force y >= 0.1 for deviance to work for (1 - variancePower). see tweedie()$dev.resid +override def deviance(y: Double, mu: Double, weight: Double): Double = { + 2.0 * weight * +(y * yp(math.max(y, 0.1), mu, 1.0 - variancePower) - yp(y, mu, 2.0 - variancePower)) +} + +// This depends on the density of the tweedie distribution. Not yet implemented. +override def aic( +predictions: RDD[(Double, Double, Double)], +deviance: Double, +numInstances: Double, +weightSum: Double): Double = { + 0.0 +} + +override def project(mu: Double): Double = { + if (mu < epsilon) { +epsilon + } else if (mu.isInfinity) { +Double.MaxValue --- End diff -- Out of curiosity is this meaningful to "cap" at Double.MaxValue? By the time you get there a lot of stuff is going to be infinite or not meaningful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93216003 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 --- End diff -- This is a global shared variable -- we really can't do this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93215641 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 + +override def initialize(y: Double, weight: Double): Double = { + if (variancePower > 1.0 && variancePower < 2.0) { +require(y >= 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +math.max(y, 0.1) --- End diff -- If we're going to use this magic 0.1 constant in many places, factor out a constant? 0.1 seems quite large as an 'epsilon' but I guess that's what R's implementation uses for whatever reason? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93215688 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 + +override def initialize(y: Double, weight: Double): Double = { + if (variancePower > 1.0 && variancePower < 2.0) { +require(y >= 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +math.max(y, 0.1) + } else { +require(y > 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +y + } +} + +override def variance(mu: Double): Double = math.pow(mu, variancePower) + +private def yp(y: Double, mu: Double, p: Double): Double = { + (math.pow(y, p) - math.pow(mu, p)) / p +} + +// Force y >= 0.1 for deviance to work for (1 - variancePower). see tweedie()$dev.resid +override def deviance(y: Double, mu: Double, weight: Double): Double = { + 2.0 * weight * +(y * yp(math.max(y, 0.1), mu, 1.0 - variancePower) - yp(y, mu, 2.0 - variancePower)) +} + +// This depends on the density of the tweedie distribution. Not yet implemented. +override def aic( +predictions: RDD[(Double, Double, Double)], +deviance: Double, +numInstances: Double, +weightSum: Double): Double = { + 0.0 --- End diff -- Throw a UnsupportedOperationException? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/16350 Maybe we should just drop the UT (so we don't have to add the metrics). cc @ericl WDYT? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/16351 [SPARK-18943][SQL] Avoid per-record type dispatch in CSV when reading ## What changes were proposed in this pull request? `CSVRelation.csvParser` does type dispatch for each value in each row. We can prevent this because the schema is already kept in `CSVRelation`. So, this PR proposes that converters are created first according to the schema, and then apply them to each. ## How was this patch tested? Tests in `CSVTypeCastSuite` and `CSVRelation` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark type-dispatch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16351.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16351 commit e72d1bc419dfd6da7f6e298d5b5412dba69eb5ad Author: hyukjinkwon Date: 2016-12-20T11:54:05Z Avoid per-record type dispatch in CSV when reading --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16351 **[Test build #70403 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70403/testReport)** for PR 16351 at commit [`e72d1bc`](https://github.com/apache/spark/commit/e72d1bc419dfd6da7f6e298d5b5412dba69eb5ad). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r93229282 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,69 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. blockBounds(start) gives the +// index of the end of the block and blockBounds(end) gives the index of the start of the +// block. Entries that are not the start or end of the block are meaningless. The idea is that --- End diff -- I'm still not sure about this comment -- how can `blockBounds(x)` be both the start and end of a block? the implementation below is identical. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16351#discussion_r93230247 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala --- @@ -215,84 +215,133 @@ private[csv] object CSVInferSchema { } private[csv] object CSVTypeCast { + // A `ValueConverter` is responsible for converting the given value to a desired type. + private type ValueConverter = String => Any /** - * Casts given string datum to specified type. - * Currently we do not support complex types (ArrayType, MapType, StructType). + * Create converters which cast each given string datum to each specified type in given schema. + * Currently, we do not support complex types (`ArrayType`, `MapType`, `StructType`). * - * For string types, this is simply the datum. For other types. + * For string types, this is simply the datum. + * For other types, this is converted into the value according to the type. * For other nullable types, returns null if it is null or equals to the value specified * in `nullValue` option. * - * @param datum string value - * @param name field name in schema. - * @param castType data type to cast `datum` into. - * @param nullable nullability for the field. + * @param schema schema that contains data types to cast the given value into. * @param options CSV options. */ - def castTo( + private[sql] def makeConverters( --- End diff -- Ops, I can remove this access modifier. Will remove it soon and the one below too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16350 **[Test build #70402 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70402/consoleFull)** for PR 16350 at commit [`132d12e`](https://github.com/apache/spark/commit/132d12ee1457c41a0bec56516ab5a41d36d8ac1f). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16350 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16350 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70402/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16233 We need a way to isolate the analysis of view text with a different context. Using wrapper is one solution, and my proposal doesn't introduce a wrapper, instead it applies the context in place, i.e. when we parse the view text in `SessionCatalog.lookupRelation`, set the database of `UnresolvedRelation` right away, according to view context(only contains `currentDatabase` at the first version, we can add more information in the future). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16233 hmm, it seems hard to apply the view context in place, considering things like CTE. I think it's better to introduce analysis context, which can limit the max depth of stacked view easily. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16351 **[Test build #70404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70404/testReport)** for PR 16351 at commit [`22c9a8a`](https://github.com/apache/spark/commit/22c9a8a9bb812eaa557aec09f3cf0ab25e97b3bf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL progra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16329 **[Test build #70405 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70405/testReport)** for PR 16329 at commit [`2c1f182`](https://github.com/apache/spark/commit/2c1f1829677e74dc2dd2d7d67233b142f14007e8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16323: [SPARK-18911] [SQL] Define CatalogStatistics to i...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/16323#discussion_r93240303 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -198,6 +200,10 @@ case class CatalogTable( locationUri, inputFormat, outputFormat, serde, compressed, properties)) } + def withStats(cboStatsEnabled: Boolean): CatalogTable = { --- End diff -- Yes I also think that's better, but as @cloud-fan said, we can't get the config in `def statistics`, we have to modify many places to support this. I'm about to do such modifications, do you have any advices to minimize the changes? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16350 **[Test build #70406 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70406/consoleFull)** for PR 16350 at commit [`80b8664`](https://github.com/apache/spark/commit/80b86646e0f1af8fb99d78aaf3f16dc7e752a99d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL progra...
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/16329 @marmbrus I have updated the pull request. The compiled docs can be found [here](https://aokolnychyi.github.io/spark-docs/sql-programming-guide.html). I did not manage to build the Java API docs. I believe the problem is in my local installation. Therefore, I checked each url manually, they should work once the API docs are compiled. I will verify everything one more time in the nightly build. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15996 **[Test build #70407 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70407/testReport)** for PR 15996 at commit [`28f88ef`](https://github.com/apache/spark/commit/28f88ef7b4796c6d07c80cf7fa942b27103937dd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL progra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16329 **[Test build #70405 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70405/testReport)** for PR 16329 at commit [`2c1f182`](https://github.com/apache/spark/commit/2c1f1829677e74dc2dd2d7d67233b142f14007e8). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` public static class Employee implements Serializable ` * ` public static class MyAverage extends Aggregator ` * ` case class Employee(name: String, salary: Long)` * ` case class Average(var sum: Long, var count: Long)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL progra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16329 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL progra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16329 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70405/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/12775 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15996 **[Test build #70408 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70408/testReport)** for PR 15996 at commit [`97dc307`](https://github.com/apache/spark/commit/97dc3079650e24d8412f093ad4184077ddf37c26). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12775 **[Test build #70409 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70409/testReport)** for PR 12775 at commit [`9778cef`](https://github.com/apache/spark/commit/9778cefce3e152d559e53cd4e2f5a113e561f0ff). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16296 **[Test build #70410 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70410/testReport)** for PR 16296 at commit [`a553366`](https://github.com/apache/spark/commit/a553366e9828c2a68a25023181beb9acbf908aa0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16352: [SPARK-18947][SQL] SQLContext.tableNames should n...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/16352 [SPARK-18947][SQL] SQLContext.tableNames should not call Catalog.listTables ## What changes were proposed in this pull request? It's a huge waste to call `Catalog.listTables` in `SQLContext.tableNames`, which only need the table names, while `Catalog.listTables` will get the table metadata for each table name. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark minor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16352.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16352 commit f12dc7924dd3e4847578992d20d36a26d3d02792 Author: Wenchen Fan Date: 2016-12-20T14:27:37Z SQLContext.tableNames should not call Catalog.listTables --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16352 cc @yhuai @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16352 **[Test build #70411 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70411/testReport)** for PR 16352 at commit [`f12dc79`](https://github.com/apache/spark/commit/f12dc7924dd3e4847578992d20d36a26d3d02792). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16351 **[Test build #70403 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70403/testReport)** for PR 16351 at commit [`e72d1bc`](https://github.com/apache/spark/commit/e72d1bc419dfd6da7f6e298d5b5412dba69eb5ad). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16351 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70403/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16351 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16351 cc @cloud-fan, could I please ask to take a look? I remember a similar PR was reviewed by you before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...
Github user jnh5y commented on a diff in the pull request: https://github.com/apache/spark/pull/16329#discussion_r93261268 --- Diff: docs/sql-programming-guide.md --- @@ -382,6 +382,52 @@ For example: +## Aggregations + +The [built-in DataFrames functions](api/scala/index.html#org.apache.spark.sql.functions$) mentioned +before provide such common aggregations as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc. --- End diff -- As a suggestion, I'd change this to read: "The [built-in DataFrames functions](api/scala/index.html#org.apache.spark.sql.functions$) provide common aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, and `min()`." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16352 The same issue also exists in [getTableNames](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L278). Could we also fix there? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...
Github user jnh5y commented on a diff in the pull request: https://github.com/apache/spark/pull/16329#discussion_r93262242 --- Diff: docs/sql-programming-guide.md --- @@ -382,6 +382,52 @@ For example: +## Aggregations + +The [built-in DataFrames functions](api/scala/index.html#org.apache.spark.sql.functions$) mentioned +before provide such common aggregations as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc. +While those functions are designed for DataFrames, Spark SQL also has type-safe versions for some of them in +[Scala](api/scala/index.html#org.apache.spark.sql.expressions.scalalang.typed$) and +[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to work with strongly typed Datasets. +Moreover, users are not limited to the predefined aggregate functions and can create their own. --- End diff -- I think it'd be worth showing an Spark SQL example using the included/pre-defined functions. Since your example implements 'avg', maybe use 'min' / 'max'? Alternatively, the example could be added to the SQL statements in the main driver for the UserDefinedAggregateFunction implementations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16352 LGTM except the above comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...
Github user davies commented on the issue: https://github.com/apache/spark/pull/16232 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16351 **[Test build #70404 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70404/testReport)** for PR 16351 at commit [`22c9a8a`](https://github.com/apache/spark/commit/22c9a8a9bb812eaa557aec09f3cf0ab25e97b3bf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16351 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70404/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16296 **[Test build #70410 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70410/testReport)** for PR 16296 at commit [`a553366`](https://github.com/apache/spark/commit/a553366e9828c2a68a25023181beb9acbf908aa0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16296 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70410/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16296 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16351: [SPARK-18943][SQL] Avoid per-record type dispatch in CSV...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16351 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r9328 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,69 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. blockBounds(start) gives the +// index of the end of the block and blockBounds(end) gives the index of the start of the +// block. Entries that are not the start or end of the block are meaningless. The idea is that --- End diff -- It relies on knowing ahead of time wether `x` is the start or end index1. If it's a start index, `blockBounds(x)` gives the ending index of that block. If it's an end index, `blockBounds(x)` gives the starting index of the block. So yes, the implementations of `blockStart` and `blockEnd` are identical. I just have two different functions because it makes the code more readable. Maybe the comment from scikit-learn (where I borrowed this idea from) explains it better? (their `target` = my `blockBounds`) > target describes a list of blocks. At any time, if [i..j] (inclusive) is > an active block, then blockBounds[i] := j and target[j] := i. The trick is just in maintaining the array so that the above property is always true. At initialization, it's trivially true because all blocks have only one element, and `blockBounds(x)` = `x`. After initialization, `blockBounds` is only modified by the merge function, which is set up to modify `blockBounds` so that this property is preserved, then return the starting index of the newly-merged block. This is admittedly a bit tricky, but it's a lot faster than the implementation I did where created a doubly-linked list of `Block` case classes. I'm open to suggestions on how to explain it better. 1 You could actually figure out whether you have a start or an end index by comparing `blockBounds(x)` to `x`. The lesser value will be the start index. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16240: [SPARK-16792][SQL] Dataset containing a Case Class with ...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16240 None of them. The compilation will fail. That is why I had to provide those additional implicits. ``` scala> class Test[T] defined class Test scala> implicit def test1[T <: Seq[String]]: Test[T] = null test1: [T <: Seq[String]]=> Test[T] scala> implicit def test2[T <: Product]: Test[T] = null test2: [T <: Product]=> Test[T] scala> def test[T : Test](t: T) = null test: [T](t: T)(implicit evidence$1: Test[T])Null scala> test(List("abc")) :31: error: ambiguous implicit values: both method test1 of type [T <: Seq[String]]=> Test[T] and method test2 of type [T <: Product]=> Test[T] match expected type Test[List[String]] test(List("abc")) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 @viirya Better to remove them, or throw an error? Personally, I'd rather be alerted that I'm passing invalid input, rather than have it "fixed" for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16350 **[Test build #70406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70406/consoleFull)** for PR 16350 at commit [`80b8664`](https://github.com/apache/spark/commit/80b86646e0f1af8fb99d78aaf3f16dc7e752a99d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16350 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70406/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16350 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15996 **[Test build #70407 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70407/testReport)** for PR 15996 at commit [`28f88ef`](https://github.com/apache/spark/commit/28f88ef7b4796c6d07c80cf7fa942b27103937dd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15996 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70407/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15996 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12775 **[Test build #70409 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70409/testReport)** for PR 12775 at commit [`9778cef`](https://github.com/apache/spark/commit/9778cefce3e152d559e53cd4e2f5a113e561f0ff). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12775 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12775 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70409/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16353: [SPARK-18948][MLlib] Add Mean Percentile Rank met...
GitHub user daniloascione opened a pull request: https://github.com/apache/spark/pull/16353 [SPARK-18948][MLlib] Add Mean Percentile Rank metric for ranking algorithms ## What changes were proposed in this pull request? This PR adds the implementation of Mean Percentile Rank (MPR) metric in mllib.evaluation, as described in the paper âCollaborative Filtering for Implicit Feedback Datasets.â (Hu, Y., Y. Koren, and C. Volinsky doi:10.1109/ICDM.2008.22). This metric is useful to evaluate recommendations given by the ALS with implicit feedback. ## How was this patch tested? Additional test cases have been added to test Mean Percentile Rank (MPR). You can merge this pull request into a Git repository by running: $ git pull https://github.com/daniloascione/spark SPARK-18948 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16353.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16353 commit ed66bb09eddf776e932b29a7e4889128aa775946 Author: Danilo Ascione Date: 2016-12-20T16:23:28Z [SPARK-18948][MLlib] Add Mean Percentile Rank metric for ranking algorithms --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15996 **[Test build #70408 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70408/testReport)** for PR 15996 at commit [`97dc307`](https://github.com/apache/spark/commit/97dc3079650e24d8412f093ad4184077ddf37c26). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15996 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70408/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15996 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16353: [SPARK-18948][MLlib] Add Mean Percentile Rank metric for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16353 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16352 **[Test build #70411 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70411/testReport)** for PR 16352 at commit [`f12dc79`](https://github.com/apache/spark/commit/f12dc7924dd3e4847578992d20d36a26d3d02792). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16352 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16352: [SPARK-18947][SQL] SQLContext.tableNames should not call...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16352 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70411/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16353: [SPARK-18948][MLlib] Add Mean Percentile Rank metric for...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16353 This is pretty specific to ALS, and relies on the r_ui strength value in the paper. I'm not sure it is that general. Without this weight, it's somewhat related to simple existing metrics like mean reciprocal rank. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16354: [SPARK-18886][Scheduler][WIP] Adjust Delay schedu...
GitHub user squito opened a pull request: https://github.com/apache/spark/pull/16354 [SPARK-18886][Scheduler][WIP] Adjust Delay scheduling to prevent under-utilization of cluster ## What changes were proposed in this pull request? This is a significant change to delay scheduling to avoid under-utilization of cluster resources when there are locality preferences for a subset of resources. The main change here is that the delay is no longer reset when any task is scheduled at a tighter locality constraint. Instead, each task set starts the locality timer the first time it fails to utilize a resource offer due to locality constraints. One task set *never* tightens the locality constraints, even if subsequent offers are made that utilize tighter constraints. A more complete description of the issues w/ the previous scheduling method can be found under the jira. ## How was this patch tested? Added unit test for original issue. Ran all unit tests in o.a.s.scheduler.* manually. Full tests via jenkins. TODO * [ ] add more unit tests, especially for recompute locality levels. * [ ] code cleanup (especially all the logging added) You can merge this pull request into a Git repository by running: $ git pull https://github.com/squito/spark delay_sched-SPARK-18886 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16354.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16354 commit 8629823dcb61f67207ee5b6a6a1789a4c38e898f Author: Imran Rashid Date: 2016-12-15T17:28:48Z failing test case commit 348a9f44a6f34e6ac15f4ece70b0178d134d0cc3 Author: Imran Rashid Date: 2016-12-20T03:35:55Z "working" version -- but this is actually a significant departure from old delay scheduling commit 8b7fd1adf510ef15a7c30aebdd4f029e71e2e50f Author: Imran Rashid Date: 2016-12-20T03:57:44Z test update commit 22086999a9644086d6787fc4db5b2367e6ba70fe Author: Imran Rashid Date: 2016-12-20T04:18:40Z fix condition commit af88dd8f8942b12edda2a466b292dd3bccdfbc4e Author: Imran Rashid Date: 2016-12-20T04:19:05Z update tests to reflect change in delay scheduling behavior commit 27983a9a6d3f7d675bbfa83eb116e8329869aed7 Author: Imran Rashid Date: 2016-12-20T04:19:19Z logging commit 647bf400a0963ff8f5381e47f895e3cc606aa854 Author: Imran Rashid Date: 2016-12-20T17:13:59Z fix other test cases, more fixes to recomputeLocality() commit 2e5307f971767c0d5a228e3058db31244351da2d Author: Imran Rashid Date: 2016-12-20T17:47:02Z Merge branch 'master' into delay_sched-SPARK-18886 commit 449ba20c9c642884f5dcc5feccfa64cb1da833f2 Author: Imran Rashid Date: 2016-12-20T17:47:11Z remove TODO --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/16337 I have tested a few runs on `SQLQueryTestSuite` to confirm it allows to have sub-directories under `sql/core/src/test/resources/sql-tests/[inputs|results]` to group test files further. By reading the code, I'm pretty sure it supports multi-level sub-directories. The only requirement is the name of the file needs to be unique globally under inputs/. With that knowledge, I propose we have this structure under directory `sql/core/src/test/resources/sql-tests/[inputs`: subquery/ subquery/in-subquery/ subquery/in-subquery/simple-in.sql subquery/in-subquery/simple-not-in.sql subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both) subquery/in-subquery/not-in-group-by.sql subquery/in-subquery/in-order-by.sql subquery/in-subquery/in-limit.sql subquery/in-subquery/in-having.sql subquery/in-subquery/in-joins.sql subquery/in-subquery/not-in-joins.sql subquery/in-subquery/in-set-operations.sql subquery/in-subquery/in-with-cte.sql subquery/in-subquery/not-in-with-cte.sql subquery/in-subquery/in-multiple-columns.sql ⦠subquery/exists-subquery/ subquery/scalar-subquery/ ``` Each test file will contain approximately 10-20 test cases. Some of the test cases will be able to classified in multiple ways. In that case, the tester will use his/her best judgement to classify them, or maybe we can entertain the idea of "complex" in the file name (I personally don't like it as those terms are subjective. To someone a SQL with a-few-table joins is complex. Others may see it not.) We can run a single test file in `.../inputs` or any sub-directory under it with the following command `build/sbt "~sql/test-only *SQLQueryTestSuite -- -z .sql"` The downside is needs to be the exact name. It does not allow wildcard characters at this point. Perhaps this is something we can enhance in the future. Lastly, we will break up this in-subquery test cases to multiple PRs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93289668 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 --- End diff -- Would you please suggest a better way to set the variancePower? I want to be consistent with the existing code to have the `Family` objects, but I need to also pass on the input `variancePower` to the `Tweedie` object which is used to compute the variance function. Any suggestion will be highly appreciated. @srowen @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16354: [SPARK-18886][Scheduler][WIP] Adjust Delay scheduling to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16354 **[Test build #70412 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70412/testReport)** for PR 16354 at commit [`449ba20`](https://github.com/apache/spark/commit/449ba20c9c642884f5dcc5feccfa64cb1da833f2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16354: [SPARK-18886][Scheduler][WIP] Adjust Delay scheduling to...
Github user squito commented on the issue: https://github.com/apache/spark/pull/16354 @mridulm @markhamstra @kayousterhout This is *not* ready to merge -- it needs some cleanup and more tests -- but I thought that seeing an implementation might help think through the design. I think the discussion should still center on the overall approach, and that discussion should probably happen on the jira, not here. (I'll happily fix code issues if that would help.) Obviously, this is a pretty big change to the way delay scheduling works; I'd gladly consider alternative approaches that dont' involve such a large change in behavior, but I don't see them. IMO this problem is serious enough that it merits the large behavior change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93290858 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 + +override def initialize(y: Double, weight: Double): Double = { + if (variancePower > 1.0 && variancePower < 2.0) { +require(y >= 0.0, "The response variable of the specified Tweedie distribution " + + s"should be non-negative, but got $y") +math.max(y, 0.1) --- End diff -- I have not seen a formal justification for the choice of 0.1 in R. This seminal [paper](http://users.du.se/~lrn/StatMod10/HomeExercise2/Nelder_Pregibon.pdf) suggests 1/6 (about 0.17) to be the best constant. I would prefer to be consistent with R so that we can make comparison. Using a constant is a good idea. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93291042 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -592,6 +629,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine } /** +* Tweedie exponential family distribution. +* The default link for the Tweedie family is the log link. +*/ + private[regression] object Tweedie extends Family("tweedie") { + +val defaultLink: Link = Log + +var variancePower: Double = 1.5 --- End diff -- I think the Tweedie implementation needs to be able to access parameters of the GLM, to read off variancePower. As it is this is a global variable and two jobs would overwrite each others' values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16344#discussion_r93290854 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -242,7 +275,7 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value) override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = { -val familyObj = Family.fromName($(family)) +val familyObj = Family.fromName($(family), $(variancePower)) --- End diff -- I don't think we can do this either. variancePower is specific to one family, not a property of all of them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org