[GitHub] spark issue #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistry
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16981 **[Test build #73815 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73815/testReport)** for PR 16981 at commit [`ddc06cf`](https://github.com/apache/spark/commit/ddc06cf46b3b2730dc5ec8f49e12225c60d05b7c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16910: [SPARK-19575][SQL]Reading from or writing to a hive serd...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16910 **[Test build #73829 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73829/testReport)** for PR 16910 at commit [`15c0a77`](https://github.com/apache/spark/commit/15c0a77714eb4ed5221f47d54ed31fcc10a95303). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17081: [SPARK-18726][SQL]resolveRelation for FileFormat ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17081 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17081: [SPARK-18726][SQL]resolveRelation for FileFormat DataSou...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17081 thanks, merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17096 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73822/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17096 **[Test build #73822 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73822/testReport)** for PR 17096 at commit [`cd235a7`](https://github.com/apache/spark/commit/cd235a7f641da8a350b8ace0e4c0691ccac189f2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17096 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17147: [Minor][Doc] Fix doc for web UI https configuration
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17147 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17147: [Minor][Doc] Fix doc for web UI https configuration
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17147 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73826/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17147: [Minor][Doc] Fix doc for web UI https configuration
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17147 **[Test build #73826 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73826/testReport)** for PR 17147 at commit [`22aa879`](https://github.com/apache/spark/commit/22aa879bd1ec8f51fbb2af62cc62ce71662542f3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16944: [SPARK-19611][SQL] Introduce configurable table schema i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16944 **[Test build #73828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73828/testReport)** for PR 16944 at commit [`95af481`](https://github.com/apache/spark/commit/95af4810b9c85b2b8680d7791cf298ed147e33c6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17001: [SPARK-19667][SQL]create table with hiveenabled in defau...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17001 **[Test build #73827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73827/testReport)** for PR 17001 at commit [`e3a467e`](https://github.com/apache/spark/commit/e3a467e52b73dc1f67fb2b669d551a7b9bb904b6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17096 @holdenk and @viirya, I got rid of the changes in `types.py` and only left that I am pretty sure. There are two kind of changes here that look used in the only local scope. One seems for used `getattr` I guess it is fine as below: ```python >>> getattr("a", u"__str__") >>> getattr("a", "__str__") ``` and other one seems used for setting an parameter to JVM which seems already used in the code base much more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17136: [SPARK-19783][SQL] Treat shorter/longer lengths of token...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17136 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73816/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17136: [SPARK-19783][SQL] Treat shorter/longer lengths of token...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17136 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17136: [SPARK-19783][SQL] Treat shorter/longer lengths of token...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17136 **[Test build #73816 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73816/testReport)** for PR 17136 at commit [`5a01a9d`](https://github.com/apache/spark/commit/5a01a9dcbe1fb922a7e240fdee3bd4b7fa4e471a). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17122: [SPARK-19786][SQL] Facilitate loop optimizations in a JI...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17122 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17122: [SPARK-19786][SQL] Facilitate loop optimizations in a JI...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17122 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73813/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17147: [Minor][Doc] Fix doc for web UI https configuration
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17147 **[Test build #73826 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73826/testReport)** for PR 17147 at commit [`22aa879`](https://github.com/apache/spark/commit/22aa879bd1ec8f51fbb2af62cc62ce71662542f3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17122: [SPARK-19786][SQL] Facilitate loop optimizations in a JI...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17122 **[Test build #73813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73813/testReport)** for PR 17122 at commit [`7f095c0`](https://github.com/apache/spark/commit/7f095c0bdae1ff15859bec399fdd705bff379be0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16944: [SPARK-19611][SQL] Introduce configurable table schema i...
Github user budde commented on the issue: https://github.com/apache/spark/pull/16944 Retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17001: [SPARK-19667][SQL]create table with hiveenabled i...
Github user windpiger commented on a diff in the pull request: https://github.com/apache/spark/pull/17001#discussion_r104101806 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala --- @@ -905,3 +934,91 @@ object SPARK_18989_DESC_TABLE { } } } + +object SPARK_19667_CREATE_TABLE { + def main(args: Array[String]): Unit = { +val spark = SparkSession.builder().enableHiveSupport().getOrCreate() +try { + val warehousePath = s"file:${spark.sharedState.warehousePath.stripSuffix("/")}" + val defaultDB = spark.sessionState.catalog.getDatabaseMetadata("default") + // default database use warehouse path as its location + assert(defaultDB.locationUri.stripSuffix("/") == warehousePath) + spark.sql("CREATE TABLE t(a string)") + + val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t")) + // table in default database use the location of default database which is also warehouse path + assert(table.location.stripSuffix("/") == s"$warehousePath/t") + spark.sql("INSERT INTO TABLE t SELECT 1") + assert(spark.sql("SELECT * FROM t").count == 1) + + spark.sql("CREATE DATABASE not_default") + spark.sql("USE not_default") + spark.sql("CREATE TABLE t1(b string)") + val table1 = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t1")) + // table in not default database use the location of its own database + assert(table1.location.stripSuffix("/") == s"$warehousePath/not_default.db/t1") +} finally { + spark.sql("USE default") +} + } +} + +object SPARK_19667_VERIFY_TABLE_PATH { + def main(args: Array[String]): Unit = { +val spark = SparkSession.builder().enableHiveSupport().getOrCreate() +try { + val warehousePath = s"file:${spark.sharedState.warehousePath.stripSuffix("/")}" --- End diff -- I am doing this modify --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17147: [Minor][Doc] Fix doc for web UI https configurati...
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/17147 [Minor][Doc] Fix doc for web UI https configuration ## What changes were proposed in this pull request? Doc about enabling web UI https is not correct, "spark.ui.https.enabled" is not existed, actually enabling SSL is enough for https. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark fix-doc-ssl Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17147.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17147 commit 22aa879bd1ec8f51fbb2af62cc62ce71662542f3 Author: jerryshaoDate: 2017-03-03T07:30:41Z Fix doc for https web ui Change-Id: I77e0e0806a94e50e366d199c9a9d98739ed326c7 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17145 **[Test build #73825 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73825/testReport)** for PR 17145 at commit [`f5a35f6`](https://github.com/apache/spark/commit/f5a35f6bc3ec032f429137676b99b888ae326acc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16696#discussion_r104101253 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -116,22 +116,22 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared withTempView("test") { --- End diff -- is this test duplicated with the newly added limit test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query type dos...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17145 unrelated failure: ` org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false`. retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16696#discussion_r104101031 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.statsEstimation + +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, Literal} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.types.IntegerType + + +class StatsEstimationSuite extends StatsEstimationTestBase { --- End diff -- `BasicStatsEstimationSuite`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17094 **[Test build #73823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73823/testReport)** for PR 17094 at commit [`d7dceeb`](https://github.com/apache/spark/commit/d7dceebb5fecc22c74a4ba2a334ab8ca492a518b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16696 **[Test build #73824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73824/testReport)** for PR 16696 at commit [`5692939`](https://github.com/apache/spark/commit/56929391719053e72791abe127b10a3316b51141). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17096 **[Test build #73822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73822/testReport)** for PR 17096 at commit [`cd235a7`](https://github.com/apache/spark/commit/cd235a7f641da8a350b8ace0e4c0691ccac189f2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16696#discussion_r104100931 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala --- @@ -1,64 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.sql.catalyst.statsEstimation - -import org.apache.spark.sql.catalyst.CatalystConf -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference} -import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, Statistics} -import org.apache.spark.sql.types.IntegerType - - -class StatsConfSuite extends StatsEstimationTestBase { --- End diff -- why remove this test suite? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17135: SPARK-19794 Release HDFS Client after read/write checkpo...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/17135 I remember FileSystem will be cached internally by default. Closing it probably will introduce some performance regression. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17094 Removed WIP, think it's ready now :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query type dos...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17145 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73817/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query type dos...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17145 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query type dos...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17145 **[Test build #73817 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73817/testReport)** for PR 17145 at commit [`f5a35f6`](https://github.com/apache/spark/commit/f5a35f6bc3ec032f429137676b99b888ae326acc). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16696 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17094 **[Test build #73821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73821/testReport)** for PR 17094 at commit [`76eda69`](https://github.com/apache/spark/commit/76eda69de903f2d9aae4ce17a1b9555b0403588d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17094 **[Test build #73820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73820/testReport)** for PR 17094 at commit [`46630d1`](https://github.com/apache/spark/commit/46630d1bb928ae0dea056e78afd02d76bc0da6af). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17094 **[Test build #73819 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73819/testReport)** for PR 17094 at commit [`f7e9169`](https://github.com/apache/spark/commit/f7e91699ac2af2a0baed3bc7fe0befafa21a862f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...
Github user witgo commented on the issue: https://github.com/apache/spark/pull/15505 [SPARK-18890_20170303](https://github.com/witgo/spark/commits/SPARK-18890_20170303) `s code is older but the test case running time is 5.2 s --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17096 Let me check if each is fine for sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17096 @viirya, thank you so much for taking a look and your time. So, basically, the second case it compares str to unicode as below: ```python >>> u"測試" == u"測試".encode("utf-8") False ``` Apparently, it seems we could pass unicode as is? Let me raise another issue for this after testing and looking into this. Actually, the support in `StructType.add` seems not the problem specified in the JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/17065#discussion_r104098256 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -95,15 +84,16 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo * @param condition the compound logical expression * @param update a boolean flag to specify if we need to update ColumnStat of a column * for subsequent conditions - * @return a double value to show the percentage of rows meeting a given condition. + * @return an optional double value to show the percentage of rows meeting a given condition. * It returns None if the condition is not supported. */ def calculateFilterSelectivity(condition: Expression, update: Boolean = true): Option[Double] = { - condition match { case And(cond1, cond2) => -(calculateFilterSelectivity(cond1, update), calculateFilterSelectivity(cond2, update)) -match { +// For ease of debugging, we compute percent1 and percent2 in 2 statements. +val percent1 = calculateFilterSelectivity(cond1, update) +val percent2 = calculateFilterSelectivity(cond2, update) +(percent1, percent2) match { case (Some(p1), Some(p2)) => Some(p1 * p2) case (Some(p1), None) => Some(p1) --- End diff -- @cloud-fan @ron8hu I'm a little confused about this, for Not expression, it always becomes under-estimation if we do over-estimation, no matter it's nested or not. So should we remove support for `nested Not` or `Not`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistry
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16981 **[Test build #73818 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73818/testReport)** for PR 16981 at commit [`4efae36`](https://github.com/apache/spark/commit/4efae36533895d47e0ced19be23adf4579eb285d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistry
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16981 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73809/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistry
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16981 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistry
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16981 **[Test build #73809 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73809/testReport)** for PR 16981 at commit [`0d087b0`](https://github.com/apache/spark/commit/0d087b0f66571759ae7ea802c41ac0047d154e3c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17074: [SPARK-18646][REPL] Set parent classloader as null for E...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17074 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73805/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17074: [SPARK-18646][REPL] Set parent classloader as null for E...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17074 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14789: [SPARK-17209][YARN] Add the ability to manually u...
Github user jerryshao closed the pull request at: https://github.com/apache/spark/pull/14789 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17095: [SPARK-19763][SQL]qualified external datasource t...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17095#discussion_r104095925 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -1843,10 +1843,12 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { |OPTIONS(path "$dir") """.stripMargin) val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t")) -assert(table.location == dir.getAbsolutePath) +val dirPath = new Path(dir.getAbsolutePath) +val fs = dirPath.getFileSystem(spark.sessionState.newHadoopConf()) --- End diff -- Can you create a helper function to avoid the duplicate codes? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/14731 @srowen Waiting for your final OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104084997 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -253,7 +255,18 @@ object KMeansModel extends MLReadable[KMeansModel] { @Since("1.5.0") class KMeans @Since("1.5.0") ( @Since("1.5.0") override val uid: String) - extends Estimator[KMeansModel] with KMeansParams with DefaultParamsWritable { + extends Estimator[KMeansModel] +with KMeansParams with HasInitialModel[KMeansModel] with MLWritable { + + /** + * A KMeansModel to use for warm start. + * Note the cluster count of initial model must be equal with [[k]], + * otherwise, throws IllegalArgumentException. + * @group param + */ + @Since("2.2.0") + final val initialModel: Param[KMeansModel] = --- End diff -- I prefer doing this in the same way that ALS does it. By having separate param traits `KMeansParams extends KMeansModelParams with HasInitialModel`. It's more explicit since now our `KMeans` class would have extra params on top of `KMeansParams`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104084877 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -123,7 +126,8 @@ class KMeansModel private[ml] ( @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) -val predictUDF = udf((vector: Vector) => predict(vector)) +val tmpParent: MLlibKMeansModel = parentModel --- End diff -- Can we change it to `localParent`? That's the convention we have taken elsewhere when we want to get a separate pointer to a class member. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104095197 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -182,6 +224,7 @@ object KMeansSuite { "predictionCol" -> "myPrediction", "k" -> 3, "maxIter" -> 2, -"tol" -> 0.01 +"tol" -> 0.01, +"initialModel" -> generateRandomKMeansModel(3, 3) --- End diff -- It would be nicer to change `testEstimatorAndModelReadWrite` to accept `estimatorTestParams` and `modelTestParams` separately so we don't have to hard code certain params to be filtered out inside that method. Though we wouldn't have to that in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104091867 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -418,6 +418,8 @@ object KMeans { val RANDOM = "random" @Since("0.8.0") val K_MEANS_PARALLEL = "k-means||" + @Since("2.2.0") + val K_MEANS_INITIAL_MODEL = "initialModel" --- End diff -- It can be private I think. That, or we should update the valid options for the `setInitializationMode` doc. But I think it's best to make it private. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104092158 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -22,22 +22,28 @@ import scala.util.Random import org.apache.spark.SparkFunSuite import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.param.ParamMap -import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} -import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans} +import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, MLTestingUtils} +import org.apache.spark.ml.util.TestingUtils._ +import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, KMeansModel => MLlibKMeansModel} +import org.apache.spark.mllib.linalg.{Vectors => MLlibVectors} import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} private[clustering] case class TestRow(features: Vector) class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + import testImplicits._ + final val k = 5 @transient var dataset: Dataset[_] = _ + @transient var rData: Dataset[_] = _ override def beforeAll(): Unit = { super.beforeAll() dataset = KMeansSuite.generateKMeansData(spark, 50, 3, k) +rData = GaussianMixtureSuite.rData.map(GaussianMixtureSuite.FeatureData).toDF() --- End diff -- `GaussianMixtureSuite.rData.map(Tuple1.apply).toDF()` ? Mapping the dummy case class from another test suite is less clear. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104090529 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -337,15 +366,61 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") override def transformSchema(schema: StructType): StructType = { +if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) { + if (isSet(initialModel)) { +val initialModelK = $(initialModel).parentModel.k +if (initialModelK != $(k)) { + throw new IllegalArgumentException("The initial model's cluster count = " + +s"$initialModelK, mismatched with k = $k.") +} + } else { +throw new IllegalArgumentException("Users must set param initialModel if you choose " + + "'initialModel' as the initialization algorithm.") + } +} else { + if (isSet(initialModel)) { +logWarning(s"Param initialModel will take no effect when initMode is $initMode.") + } +} validateAndTransformSchema(schema) } + + @Since("2.2.0") + override def write: MLWriter = new KMeans.KMeansWriter(this) } @Since("1.6.0") -object KMeans extends DefaultParamsReadable[KMeans] { +object KMeans extends MLReadable[KMeans] { @Since("1.6.0") override def load(path: String): KMeans = super.load(path) + + @Since("2.2.0") + override def read: MLReader[KMeans] = new KMeansReader + + /** [[MLWriter]] instance for [[KMeans]] */ + private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter { + +override protected def saveImpl(path: String): Unit = { + DefaultParamsWriter.saveInitialModel(instance, path) + DefaultParamsWriter.saveMetadata(instance, path, sc) +} + } + + private class KMeansReader extends MLReader[KMeans] { + +override def load(path: String): KMeans = { + val metadata = DefaultParamsReader.loadMetadata(path, sc, classOf[KMeans].getName) + val instance = new KMeans(metadata.uid) + + DefaultParamsReader.getAndSetParams(instance, metadata) + DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) match { --- End diff -- This can be done as: scala DefaultParamsReader.loadInitialModel[KMeansModel](path, sc).foreach(instance.setInitialModel) I think it's nicer, but I'm not sure if there is a universal preference for side effects with options in Spark, so I'll leave it to you to decide. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104094526 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -111,12 +113,20 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => val estimator2 = testDefaultReadWrite(estimator) testParams.foreach { case (p, v) => val param = estimator.getParam(p) - assert(estimator.get(param).get === estimator2.get(param).get) + if (param.name == "initialModel") { +// Estimator's `initialModel` has same type as the model produced by this estimator. --- End diff -- This is an assumption, and is not enforced by the compiler. There is nothing in the trait `HasInitialModel[T <: Model[T]]`that prevents us from creating an estimator with an initialModel type that is not the same type of the model that the estimator produces. We can discuss whether or not we'd like to enforce this assumption, but if we do not then this method should probably be changed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104090273 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -337,15 +366,61 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") override def transformSchema(schema: StructType): StructType = { +if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) { --- End diff -- It might be nice to factor this logic out into a method like `assertInitialModelValid` or something similar. Actually, we could add an abstract method to the `HasInitialModel` trait that each subclass can implement differently. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104092773 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR val kmeans = new KMeans() testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) } + + test("training with initial model") { +val kmeans = new KMeans().setK(2).setSeed(1) +val model1 = kmeans.fit(rData) +val model2 = kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData) +model2.clusterCenters.zip(model1.clusterCenters) + .foreach { case (center2, center1) => assert(center2 ~== center1 absTol 1E-8) } + } + + test("training with initial model, error cases") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +// Sets initMode with 'initialModel', but does not specify initial model. +intercept[IllegalArgumentException] { --- End diff -- I'm not sure I agree with the behavior. We discussed it quite a bit in the other PR - maybe you can summarize the reason you went away from the previous decisions? At any rate, it seems currently we have the following behavior: | k | initMode | initialModel | result | --- | --- | --- | --- | ?| not set | set | ignore InitialModel | | ?| set | not set | error | | set (k != initialModelK) | set | set | error | | set (k == initialModelK) | set | set | use initialModel | If we keep this behavior, we should add a test for the first case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...
Github user witgo commented on the issue: https://github.com/apache/spark/pull/15505 Yes, maybe a multithreaded serialization task code can have a better performance, let me close the PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15505: [SPARK-18890][CORE] Move task serialization from ...
Github user witgo closed the pull request at: https://github.com/apache/spark/pull/15505 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17133: [SPARK-19793] Use clock.getTimeMillis when mark task as ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17133 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73807/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17133: [SPARK-19793] Use clock.getTimeMillis when mark task as ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17133 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17133: [SPARK-19793] Use clock.getTimeMillis when mark task as ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17133 **[Test build #73807 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73807/testReport)** for PR 17133 at commit [`37f26e3`](https://github.com/apache/spark/commit/37f26e3e51d77548aa285856d22834683d37e889). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17067: [SPARK-19602][SQL][TESTS] Add tests for qualified column...
Github user skambha commented on the issue: https://github.com/apache/spark/pull/17067 Thanks a lot Xiao. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16883: [SPARK-17498][ML] StringIndexer enhancement for handling...
Github user imatiach-msft commented on the issue: https://github.com/apache/spark/pull/16883 @VinceShieh I added some minor comments. This is a nice feature! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104094424 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -163,25 +187,28 @@ class StringIndexerModel ( } transformSchema(dataset.schema, logging = true) +val metadata = NominalAttribute.defaultAttr + .withName($(outputCol)).withValues(labels).toMetadata() +// If we are skipping invalid records, filter them out. +val (filteredDataset, keepInvalid) = getHandleInvalid match { --- End diff -- actually, I think returning a tuple here just makes things more confusing. Maybe you can move the check outside of the match. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104093892 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -105,7 +125,11 @@ class StringIndexer @Since("1.4.0") ( @Since("1.6.0") object StringIndexer extends DefaultParamsReadable[StringIndexer] { - + private[feature] val SKIP_UNSEEN_LABEL: String = "skip" + private[feature] val ERROR_UNSEEN_LABEL: String = "error" + private[feature] val KEEP_UNSEEN_LABEL: String = "keep" --- End diff -- It would make me even happier if these were public and could be used by the test code, but I think it's up to the commiters (jkbradley) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13320: [SPARK-13184][SQL] Add a datasource-specific option minP...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/13320 @gatorsmile Could you check this and give me comments, too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104093629 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -163,25 +187,28 @@ class StringIndexerModel ( } transformSchema(dataset.schema, logging = true) +val metadata = NominalAttribute.defaultAttr + .withName($(outputCol)).withValues(labels).toMetadata() +// If we are skipping invalid records, filter them out. +val (filteredDataset, keepInvalid) = getHandleInvalid match { --- End diff -- minor style comment: instead of keepInvalid, do you think that indexInvalid might be a better name (?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query type dos...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17145 **[Test build #73817 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73817/testReport)** for PR 17145 at commit [`f5a35f6`](https://github.com/apache/spark/commit/f5a35f6bc3ec032f429137676b99b888ae326acc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104093452 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -163,25 +190,28 @@ class StringIndexerModel ( } transformSchema(dataset.schema, logging = true) +val metadata = NominalAttribute.defaultAttr + .withName($(outputCol)).withValues(labels).toMetadata() --- End diff -- I think he means that "labels" above should also include the invalid bucket. In previous ML frameworks I've worked on we've just called this "unknown". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104093159 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -105,7 +125,11 @@ class StringIndexer @Since("1.4.0") ( @Since("1.6.0") object StringIndexer extends DefaultParamsReadable[StringIndexer] { - + private[feature] val SKIP_UNSEEN_LABEL: String = "skip" + private[feature] val ERROR_UNSEEN_LABEL: String = "error" + private[feature] val KEEP_UNSEEN_LABEL: String = "keep" --- End diff -- this is very nice, good use of constants, I really like to see this type of code :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104093069 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -71,18 +92,17 @@ class StringIndexer @Since("1.4.0") ( def this() = this(Identifiable.randomUID("strIdx")) /** @group setParam */ - @Since("1.6.0") - def setHandleInvalid(value: String): this.type = set(handleInvalid, value) - setDefault(handleInvalid, "error") - - /** @group setParam */ @Since("1.4.0") def setInputCol(value: String): this.type = set(inputCol, value) /** @group setParam */ @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("2.2.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) --- End diff -- can you keep the order of the params same as before? also, why did the version change, this method existed before, seems it should remain as version 1.6 (?) also, minor style comment -- keep the setDefault(handleInvalid) below the set method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104092772 --- Diff: docs/ml-features.md --- @@ -576,7 +579,22 @@ will be generated: 2 | c| 1.0 -Notice that the row containing "d" does not appear. +Notice that the rows containing "d" or "e" do not appear. + +If you call `setHandleInvalid("keep")`, the following dataset +will be generated: + + + id | category | categoryIndex +|--|--- + 0 | a| 0.0 + 1 | b| 2.0 + 2 | c| 1.0 + 3 | d| 3.0 + 4 | e| 3.0 + + +Notice that the rows containing "d" or "e" are mapped with indices "3.0" --- End diff -- doc suggestion: rows containing "d" or "e" are mapped with indices "3.0" => rows containing "d" and "e" are mapped to index "3.0" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17136: [SPARK-19783][SQL] Treat shorter/longer lengths of token...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17136 **[Test build #73816 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73816/testReport)** for PR 17136 at commit [`5a01a9d`](https://github.com/apache/spark/commit/5a01a9dcbe1fb922a7e240fdee3bd4b7fa4e471a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104092723 --- Diff: docs/ml-features.md --- @@ -576,7 +579,22 @@ will be generated: 2 | c| 1.0 -Notice that the row containing "d" does not appear. +Notice that the rows containing "d" or "e" do not appear. + +If you call `setHandleInvalid("keep")`, the following dataset +will be generated: + + + id | category | categoryIndex +|--|--- + 0 | a| 0.0 + 1 | b| 2.0 + 2 | c| 1.0 + 3 | d| 3.0 + 4 | e| 3.0 + + +Notice that the rows containing "d" or "e" are mapped with indices "3.0" --- End diff -- doc suggestion: mapped with indices "3.0" => mapped to index "3.0" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16944: [SPARK-19611][SQL] Introduce configurable table schema i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16944 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104092627 --- Diff: docs/ml-features.md --- @@ -542,12 +543,13 @@ column, we should get the following: "a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with index `2`. -Additionally, there are two strategies regarding how `StringIndexer` will handle +Additionally, there are three strategies regarding how `StringIndexer` will handle unseen labels when you have fit a `StringIndexer` on one dataset and then use it to transform another: - throw an exception (which is the default) - skip the row containing the unseen label entirely +- map the unseen labels with indices [numLabels] --- End diff -- doc suggestion: "map the unseen labels to their own index" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16944: [SPARK-19611][SQL] Introduce configurable table schema i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16944 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73808/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15928: [SPARK-18478][SQL] Support codegen'd Hive UDFs
Github user maropu commented on the issue: https://github.com/apache/spark/pull/15928 @rxin yea, I got x1.3-1.4 performance gains in this pr. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16944: [SPARK-19611][SQL] Introduce configurable table schema i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16944 **[Test build #73808 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73808/testReport)** for PR 16944 at commit [`514ae06`](https://github.com/apache/spark/commit/514ae06e1dbe2640091c90d55354c3500857e6e2). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15928: [SPARK-18478][SQL] Support codegen'd Hive UDFs
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15928 What do you mean? The improvement was small? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17136: [SPARK-19783][SQL] Treat shorter/longer lengths of token...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/17136 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15928: [SPARK-18478][SQL] Support codegen'd Hive UDFs
Github user maropu commented on the issue: https://github.com/apache/spark/pull/15928 I looked into this though, I got a little luck from this fix. So, I'll close for now. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15928: [SPARK-18478][SQL] Support codegen'd Hive UDFs
Github user maropu closed the pull request at: https://github.com/apache/spark/pull/15928 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17140: [SPARK-19796][CORE] Fix serialization of long property v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17140 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73802/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17140: [SPARK-19796][CORE] Fix serialization of long property v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17140 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistry
Github user maropu commented on the issue: https://github.com/apache/spark/pull/16981 @gatorsmile okay, I'll fix the issues you mentioned. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistry
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16981 **[Test build #73815 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73815/testReport)** for PR 16981 at commit [`ddc06cf`](https://github.com/apache/spark/commit/ddc06cf46b3b2730dc5ec8f49e12225c60d05b7c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17140: [SPARK-19796][CORE] Fix serialization of long property v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17140 **[Test build #73802 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73802/testReport)** for PR 17140 at commit [`99692bf`](https://github.com/apache/spark/commit/99692bf9860f375eab7f7c35d17f83d2c726ae77). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17122: [SPARK-19786][SQL] Facilitate loop optimizations ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/17122#discussion_r104091814 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -206,6 +206,18 @@ trait CodegenSupport extends SparkPlan { def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = { throw new UnsupportedOperationException } + + /** + * for optimization to suppress shouldStop() in a loop of WholeStageCodegen + * + * isShouldStopRequired: require to insert shouldStop() into the loop if true + */ + def isShouldStopRequired: Boolean = { +return shouldStopRequired && !(this.parent != null && !this.parent.isShouldStopRequired) --- End diff -- Thank you for your suggestion. However, it caused an assertion failure at `"SPARK-7150 range api"` in DataFrameRangeSuite. In the failure case, `isShouldStopRequired` is called in the class hierarchy by `parent`. ` RangeExec -> FilterExec -> WholeStageCodegenExec` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistr...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16981#discussion_r104091757 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala --- @@ -55,4 +60,22 @@ object JacksonUtils { schema.foreach(field => verifyType(field.name, field.dataType)) } + + def strToStructType(schemaAsJson: String): StructType = Try { --- End diff -- yes, I'll remove --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistr...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16981#discussion_r104091471 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala --- @@ -55,4 +60,22 @@ object JacksonUtils { schema.foreach(field => verifyType(field.name, field.dataType)) } + + def strToStructType(schemaAsJson: String): StructType = Try { +DataType.fromJson(schemaAsJson).asInstanceOf[StructType] + }.getOrElse { --- End diff -- okay, I'll fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistr...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16981#discussion_r104091422 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3007,7 +3008,7 @@ object functions { * @since 2.1.0 */ def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column = -from_json(e, DataType.fromJson(schema).asInstanceOf[StructType], options) +from_json(e, JacksonUtils.strToStructType(schema), options) --- End diff -- okay, I'll do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17144 **[Test build #73814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73814/testReport)** for PR 17144 at commit [`9ec5caf`](https://github.com/apache/spark/commit/9ec5cafb32a8137645dda50c958d95c26f3948bc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16981#discussion_r104091265 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala --- @@ -174,4 +174,22 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { .select(to_json($"struct").as("json")) checkAnswer(dfTwo, readBackTwo) } + + test("SPARK-19637 Support to_json in SQL") { +// to_json --- End diff -- Nit: remove this comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org